
A FIRST COURSE ON 




Narosa l 


B K. Kale 









c *■ - 









Other Books of Interest*.. 

algebra 

Volume 1: Groups 

I.S. Luthar and I.B.S. Passi 
Analytical Geometry 
D. Chatterjee 

An Elementary Course in Partial Differential Equations 
T. Amaranath 

A First Course in Mathematical Analysis 
D. Somasundaram and B. Choudhary 
Foundations of Complex Analysis 
S. Ponnusamy 

Functional Analysis: Selected Topies 
PK. Jain 

Industrials and Applied Mathematics 
A.H. Siddiqi and K. Ahmad (eds) 

An Introduction to Measure and Integration 
t.K. Rana 

Introduction to Rings and Modules (2nd Rev. Ed) 

C. Musili 

The Lost Notebook and Other Unpublished Papers with a Biography of 
Srinivasa Ramanujan 
S. Ramanujan 

Metric Spaces (Rev. Ed)' ' * * 

P.K. Jain and K. Ahmad 
Nonlinear Functional Analysis * - 
R. Akerkar 

Probability . '** . 

J. Pitman 

i 

Sample Survey Theory 
Des Raj and P. Chandhok 
Small Area Estimation in Survey Sampling 
P. Mukhopadhyay 

.1 * 

Forthcoming 

A First Course on Probability 
T.K. Chandra and D. Chatterjee 
Mathematics Theory of Continuum Mechanics 
R. Chatterjee 

Mathematical and Statistics in Engineering and Technology 
A. Chattopadhyay (ed) 

Probability, Data Analysis and Statistics 
G. Saporta 

Sequence Spaces and Applications 
P.K. Jain (ed) 



A First Course on 

PARAMETRIC INFERENCE 


B.K. Kale 



Narosa Publishing House 

New Delhi Madras Bombay Calcutta 

London 



Dr. B.K. Kale 

Department of Statistics 
University of Pune 
Pune-411 007, INDIA 



Copyright © 1999 Narosa Publishing House, New Delhi 


NAROSA PUBLISHING HOUSE 

6 Community Centre, Panchsheel Park, New. Delhi 110 017 

22 Daryaganj, Prakash Deep, Delhi Medical Association Road, New Delhi 110 002 

3S-36 Greams Road, Thousand Lights, Madras 600 006 

306 Shiv Centre, D.B.C. Sector 17, K.U. Bazar P.O., New Mumbai 400 705 

2F-2G Shivam Chambers, 53 Syed Amir Ali Avenue, Calcutta 700 019 

3 Henrietta Street, Covent Garden, London WC2E 8LU, UK 

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system 
or transmitted in any form or by any means, electronic, mechanical, photocopying, recording 
or otherwise, without the prior permission of the publisher 


All export rights for this book vest exclusively with Narosa Publishing House 
Unauthorised export is a violation of Copyright Law and is subject to legal action. 


ISBN 81-7319-196-4 


Published by N.K. Mehra for Narosa Publishing House fi ^ 

Panchsheel Park, New Delhi 110 017 and printed at p f-. ty Centre ’ 

Delhi 110 040 (India). P “ Rephka Press p vt Ltd, 




*r 

cfif^ ?rr? 

cpft 3TTcTT 3ra^MI*|«| 3P? 

fMtf% cf>t vift 11 

7RT UT^^cR, SnfctWT 


This fruitful tree of knowledge 
Which you yourself have planted 
Please nurture it with the nectar 
Of your full attention it deserves 


Saint Dnyaneshwar—1290 A.D. 




Perface 


This book is intended to be used as a text for an introductory course on 
model based parametric inference in senior undergraduate (honours) or in 
the first year graduate program. The course can be covered well in one 
academic year or two semesters or three quarters as is normally done in 
most U.S./Canadian and Indian Universities where it is regarded as a core 
course in the Statistics Master’s degree program. A major (principal) in 
Statistics and minor (subsidiary) in Mathematics at Bachelor’s level would 
be adequate prerequisites for this text. Therefore it is assumed that the 
student knows single and multivariable calculus, elementary matrix algebra, 
standard distribution theory and elementary forms of Weak Laws of Large 
Number and Central Limit Theorems. 

In Chapter 1 a historical perspective is presented which led to the current 
frequentist approach where the evalution of the performance of any inference 
procedure is based on its sampling distribution. Such a perspective is 
necessarily subjective and other teachers may want to view the evolution of 
ideas somewhat differently. However it is necessary to present such a historical 
perspective emphasizing that most of the current inference procedures have 
been develped to provide answers to important questions in other sciences. It 
is necessary that the students and teachers should be aware of this evolution 
so that they can respond effectively in a similar way to provide statistical 
methods as needed by other scientists facing important questions in their fields. 

It is assumed here that statistical inference is an integral part of rational 
empiricism proposed by Descartes—Bacon leading to deliberate experimen¬ 
tation as an important toolto gain knowledge about the external world around 
us. Thus the text starts with the concept of information in a sample about the 
unknown parameter of interest and starts with the problem of defining and 
obtaining a minimal sufficient statistic. There after the text follows more or 
less the well trodden path but also making some detours which it is hoped 
that the students and the teachers would find interesting and useful. 

I have taught the material covered here for several years and there are 
many students and colleagues at Iowa State, Manitoba and Pune whose 
contributions, I wish to acknowledge and take this opportunity to thank 
them. I thank University Grants Commission of India for providing me a 
grant under their scheme of writing University level text book during May 
1994-May 1996 and subsequently an Emeritus Fellowship. I also wish to 
express my thanks to the authorities of University of Pune and Department 



viii Preface 

of Statistics for providing me necessary facilities to continue my research 
and complete the text after my retirement in November 1993. 

Last but not least, I must thank Mrs. A.V. Sabane for her careful typi ng 
of the manuscript and M/s Narosa Publishing House, New Delhi f 0r 
subsequent processing and bringing out the text so nicely in its present 
form. 


Pune 

January 1999 


B.K. Kale 



Contents 


Preface vu 

1. Introduction 1 

1.1 Basic Framework 1 

1.2 A Historical Perspective-I 5 

1.3 Historical Perspective-II 8 

2. Sufficient Statistic H 

2.1 Motivation 11 

2.2 Fisher Information 15 

2.3 Defining Sufficient Statistic 22 

2.4 Obtaining Sufficient Statistic 28 

2.5 Exponential Family 34 

2.6 Pitman Family 40 

3. Minimum Variance Unibased Estimation 44 

3.1 Unbiasedness 44 

3.2 Best Linear Unbiased Estimator (BLUE) 47 

3.3 Cramer-Rao Inequality and Its Applications 50 

3.4 Rao-Blackwell Theorem 54 

3.5 Completeness and Lehmann-Scheffe Theorem 59 

3.6 A Necessary and Sufficient Condition for MVUE 67 

4. Simultaneous Estimation of Several Parameters 72 

4.1 Matrix Optimality Criterion 72 

4.2 Ellipsoid of Concentration 73 

4.3 Klebenov-Linnik-Rukhin Theorem 76 

4.4 Cramer-Rao Inequality 83 

4.5 Other Optimality Criteria 87 

5. Consistent Estimators 

5.1 Consistency 90 

5.2 Method of Moments 96 

5.3 Method of Percentiles 100 

5.4 Choosing Between Consistent Estimators 


90 


105 



x Contents 


6. Consistent Asymptotically Normal Estimators 

6.1 A Basic Result 115 

6.2 Method of Moments and Percentiles 119 

6.3 CAN Estimators Multiparameter Case 124 

6.4 Method of Moments and Percentiles 127 

6.5 Gauss-Legendre-Boscovitch Revisited 134 

7. Method of Maximum Likelihood 

7.1 Introduction 138 

7.2 MLE in Exponential Family 144 

7.3 Cramer Family 148 

7.4 Cramer-Huzurbazar Theorem 152 

7.5 Multinomial With Cell Probabilities Depending 
On a Parameter 158 

7.6 Solution of Likelihood Equations 164 

7.7 Asymptotically Most Efficient Estimator 168 

8. Tests of Hypotheses-I 

8.1 Historical Perspective 174 

8.2 Critical Regions and Test Functions 177 

8.3 Neyman-Pearson Lemma and the MP Test 182 

8.4 Uniformly Most Powerful (UMP) Test 193 

8.5 Non-existence of UMP of Tests 201 

9. Tests of Hypotheses-II 

9.1 The likelihood Ratio Test (LRT) 205 

9.2 LRT for Multinomials 272 

9.3 Large Sample Tests 219 

9.4 Consistency of a Large Sample Test 229 

9.5 Convex combination of two types of errors 234 

10. Interval Estimation 

10.1 Background 239 

10.2 Shortest Expected Length Confidence Intervals 

10.3 Large Sample Confidence Intervals 245 

10.4 Unbiased Confidence Intervals 253 

10.5 Bayesian and Fiducial Intervals 257 

References 

Index 


A First Course on 
PARAMETRIC INFERENCE 




1 

Introduction 


1.1 Basic Framework 

In the standard framework of parameter estimation one starts with the data 
(*i,.... x n) which are observations on the characteristic X of n individual 
items and the data is regarded as a realization of random variables (r.v.) 
(X t , X n ) which are assumed to be independent identically distributed 
(i.i.d.) as X the r.v. whose probability law describes stochastic behaviour of 
the character X in the population. This is also described as having a random 
sample of size n on X with probability distribution specified by probability 
density function (pdf) in case of continuous r.v. X or by probability mass 
function (pmf) in case of discrete X. Henceforth we will use the word pdf 
which would also include pmf when X is discrete. We further assume that 
the pdf has a known functional form but involves an unknown parameter 
a real or vector valued 0 = (0 b ..., 0J, which is a labelling or indexing 
parameter. That is if we know the value of 0 then the pdf of X is fully 
specified and the stochastic behaviour in the population of the character X 
is completely known. The character X could be real or vector valued. The 
labelling or indexing parameter 0 varies over a set of values, called as 
parameter space and is denoted by £2. The object of inference therefore is 
the parameter 0 or a function of the parameter say y(G), which is of 
interest. We say that 0 is an indexing or a labelling parameter when the 
probability distribution of X is uniquely specified by 0 or when F(x, 0 t ) = 

fy) f° r values of x e Ri implies that 0] = 0 2 . We illustrate the above 
set up by a few examples. 


Example 1.1 Suppose that 100 seeds of brand A were planted one in each 
pot and let X, equal one or zero according as the seed in the i-th pot 
germinates or not. The data consists of (x b ..., Xioo) a sequence of ones and 
zeroes and is regarded as a realization of (X b X 2 , ..., X 100 ) such that 
components are i.i.d.r.v.s with P[X, = 1] = 0 and P[X { - 0] = 1 - 0, where 
0 represents the probability that a seed germinates. The object of estimation 
is 0 itself or a function y{9) that may be of interest. For example, consider 

V\(0) = Pg 1 0 8 (1 - 0) 2 , which is the probability that in a batch of 10 seeds 


exactly 8 seeds will germinate or y/ 2 (0) = X j^j 
represents the probability that in a batch of 20 seeds 


0 r (l - 0) 2O - f which 
at least 15 seeds will 



2 A First Course on Parametric Inference 


germinate. Another function of interest ^^ the brand 

zero otherwise. This «wwpo^» he ^ probability of genni naU S 

would be recommended to the farmers pro average at least Qn# n 

of a seed is at least .90 or roughly speakmg on an average at least 90% 0{ 

the seeds sown would germinate. (he founding fath 

It is worth pointing out that K. rearson, r lin j„ I r, ( »nt a 1 ^ m ° r 

Statistics in Biometrika (1920) mentions that one of the funda P^blems 
of Statistics is to estimate probability of at least r 2 succe u ure n 2 

trials, on the basis of the data that r i successes have een o serve m ^ 

trials. # 

One can easily show that 0 is infact a labelling parameter. Consider 

F(u, 0,) = F(u, 9 2 ) for all u e /?,. In particular take any u e (0, 1) say 
w = 1/2. Then 

6i ) = 1 - e ' = F {i’ 02 ) = 1 _ 02 

from which we conclude that 0 ] = 02 - 

Example 1 .2 It is quite well known that the Poisson distribution with pmf 

P[X = x] = e~ x x = 0, 1,2, ... serves as a good model for the number 

of times a given event E occurs in a unit time interval, e.g. number of 
telephone calls received in a telephone exchange or the number of a particles 
hitting the Geiger counter. Instead of time interval one could have other 
type of “intervals”, e.g. X could be the number of breakages per 100 m of 
a yam or Xj could be number of bacteria of a specific type found in 1 cc 
of blood of the /-th patient. One can easily show again that the parameter 
A is a labelling parameter. To prove this consider F(u, Aj) = F(u , Ao) for 

0 < u < 1 then we have e ~ A| = e~* 2 which implies that A, = A 2 . 

In a pathbreaking experiment Rutherford Chadwick and Elis (1920) 
observed 2608 time intervals of 7.5 seconds each and counted the number 
of time intervals N r in which exactly r number of a particles hit the counter. 
They obtained the following table (W. Feller, 1957) 

r 0 1 234 5 6 7 8 9>10 

N r 57 203 383 525 532 408 273 139 45 27 18 

Here n = 2608 and (X„ X 2m ) are i.i.d. Poisson r.v.s with mean X where 

Tm r V f a PartiC ‘ eS hitting the counte r in the ,-th time 
interva . Note that the data above is presented in a different way from the 

™ V u s ,,al w!, der , delerminat ' 0n ° f W ideal P h y sical instant such as 
gravity g. Usual way to estimate g is hv thp 1 • a 

observe X = 4* 2 I/T, where / is the fencth of 1 ^ 7 exper,ment a " d 

L ,en S ltl or the pendulum and T the time 


Introduction 3 


required for a fixed number of oscillations. Due to variation which depends 
on several factors such as the skill of the experimentor and measurement 
errors, the z'-th observation X,- can be represented as Xj = g + £,• where £, is 
the random error. Assuming error is normal with zero mean and variance 
a 2 we have (X l5 ..., X n ) are i.i.d. N(g, cr 2 ). Here the parameter 9 is, two 
dimentional vector, 9 = (g , <7 2 )’. Is it a labelling parameter? The answer is 
yes and we can use the following argument. Let F(x, g, <j 2 ) denote the d.f. 

of N(g , O’ 2 ). Then F(x, g { , g 2 ) = F(x, g 2 , cr|) for all x g R { implies that 
their characteristic functions are identical i.e. 


exp 


ig \t~ 


t 2 ( T 2 


r = exp 


igi * — 


t 2 c\ 


for all real t 


or ig x t- 


t 2 o} 


= igit - 


ra 


for all t. Equating real and imiginary parts 


we have gi - g 2 and cr, 2 = g\ . If the student is not familiar with characteristic 
function one can use moment generating function of N(g, cr 2 ) which is 


given by exp 


gt + 


t 2 G 2 


to claim that g, t + 


t 2 G\ 


- 82* + 


t 2 G 2 

—for all 


real t. Taking derivative at t = 0 we have g, = g 2 and then have c Tf = g\. 
Here we can view estimation of g as estimating y/ { (g, cr 2 ) = g. On the other 
hand one may be interested in estimating the error variance <r 2 through 
which we can estimate the ability of the experimenter. Thus we may want 
to estimate \i/ 2 (g, g 2 ) = cr 2 . 


Exercise 1.1 Consider the repeated measurement model Xj = 9 + e„ where e, are i.i.d. 

(a) Uniform over (- <r, <j), (b) double exponential with pdf f[e, <j) = e - {E]lty or ( C ) 

2cj J 

Cauchy with scale a with pdf M a) = . _i_. show that in each case the 

parameter (6, a) is a labelling parameter. 


Exercise 1.2 Consider the indirect measurement problem, or standard regression 
problem where the response y t at level d { is given by y f = a + Pd t , + e„ i = 1, 2, ..., n, 
where e, are i.i.d. N( 0, a 2 ). Assuming that there are at least two distinct levels i.e! 
di * dj for some pair (z, j), show that (a, j3, o 2 ) is an indexing parameter. 

^ We remark here that there are some practical situations in which the way 
data arises leads to the problem of non-identifiability. Consider the classical 
Signal plus Noise problem where observable r.v. X) = Y , + e-, where Y-, the 
“signal’ is N(9, c 2 ) and £, the ‘noise’ is N( 0, tf). Then (X„ ..., X„) are i.i.d. 

(a ’ ? + 2 ^) an< ^ ° 2 + ^ is a labelling parameter whereas the parameter 
* cr, 0 ) is not. Similarly consider a series system with two components 
. Z) such that Y . Z are independent exponentials with failure rates 9 and 
£ .j espective ly- Th e observable random variable is X = Min (K, Z) the 
a > ure time ol the z-th unit. Here (Xj, ..., X„) are i.i.d. exponentials with 



4 A First Course on Parametric Inference 


failure rate (0 + A). The parameter (0, A) is not a labelling parameter as say 
0 — 1, A = 3 and 0 = 2 and A = 2 would give rise to same distribution of 
X. The problem of non-identifiability arises quite often in many fields 
particularly in Econometrics. However, this being the first course we will 
not consider the problems of non-identifiability but instead concentrate on 
those situations in which the data (Aj, ..., X n ) arises from a probability 
distribution with real or vector valued parameter 0 a labelling parameter. 

As mentioned earlier the object of obtaining data (x \,..., x n ) is to estimate 
0 that determines the stochastic behaviour of the characteristic X which is 
under study. For example in the classical Rutherford Chadwick Elis experiment 
discussed in Ex. 1.2 the object is to estimate the parameter A of the Poisson 
distribution which describes the stochastic behaviour of number of or-particles 
that hit the counter in a unit interval of time 7.5 seconds. On the other hand 
in Ex. 1.1, the object of experiment is to estimate the parameter 0 the 
probability that the seed will germinate on the basis of observations on 
Bernoulli Series of trials (X lt ..., A 10 o). I* 1 E x - El, an obvious estimate of 
0 is the observed proportion of seeds that have germinated out of the 100 


A ^ . _ 

which were planted. This estimate can be expressed as 0 = - x the 

sample mean. In Ex. 1.2 we may suggest the estimate of A as x , the sample 
mean again since the parameter A is also the population mean or expected 
value of X. With a similar logic we can propose x as an estimate of g in 


Ex. 1.3. 

Two points emerge immediately, namely the fact that the estimate is a 
well defined function of the sample observations and therefore the estimate 
may change from sample to sample. For example in Ex. 1.1 the estimate 


9 = x = rln for all those sequences, ^ ^ J out of 2" sequences, in which r 

seeds have germinated out of n = 100 seeds planted but would change as r, 
the total number of seeds germinated, would change from sample to sample. 
Thus an estimate is infact a random variable taking value T(jc,, ..., *„) 
depending on the observed sample values (xj, ..., x n ). To emphasize this 
aspect, we would now use the term estimator, i.e. a rule which specifies the 
function T(X h ..., X n ) for all possible values of random vector (X h X„), 
varying over the sample space. The estimate T(xj, ..., x„) thus would be a 
realization of T{X U ..., X n ) for the observed sample (x h ..., x n ). By using 
the techniques of transformation or from basic principles, we could, at least 
theoretically obtain the sampling distribution of the estimator T. Thus in 

Ex. 1.1 estimator T(X h ..., X n ) = -2*, =X would take (n + 1) distinct 

/I 


values (0, 1 In, ..., 1) and P(X = rln) = P(LX; = r) = 0^(1 - 9) n ~\ 

r “ 0, 1,2, ...,«. Similarly, in repeated measurement model with normal 
errors considered in Ex. 1.3, the estimator T(X h ..., X n ) = X would itself 



Introduction 5 


and in general ? 1Can ^ an< * var iance a 1 In. We thus note that in both cases, 
depend on G Th S ° ^ sam P^ n 8 distribution of an estimator T would also 
with pdf lfi x m 1 ° * e basic frame work in which (X„ .... X„) are i.i.d. 

induced comLZ'J * n) We add now the estimator T(X, .X„) with 

suggest differ,-!, I" 8 Class of pdfs («(*, A), 9 e Q}. Of course one can 
normal erro ® stun &tors for the same parameter ft For example, in the 

([n/21 + iwi* e * 0ne could su 8gest using sample median X ([n/ 2]+')- the 
n component of the order statistic of the sample or, in general, 

y _ y 

i=i 1 w ^ ere are specified constants and X (1) , X (n) are the 

extr#»T^ ta ^u^ CS ,^ e sam P^ e - The weights /,’s are chosen in such a way that 
comno 6 ° Se ™ at * on suc ^ as ^(i)» X (n) get less weight even zero and the 
for p ,. nen S ° 0r< ^ er sta d s dcs around median get higher weightage. Similarly, 
S imatin § variance <r 2 and the standard deviation a in the normal 


model, apart from the classical estimators o\ = — Z (X, - X) 2 and its square 

{ Mr 

A. T (V _ y\2\ 1/2 , . , 

n V l A ) r > many other estimators were suggested and 


used.For example, - ^n/2 ZIX; - XI was recommended by Peter 

and & 2 was recommended by Bessel, one can use a multiple of the sample 

range ( ^ - X^) C n as an estimate of < 7 , the standard deviation, where C,, 
is a suitable constant. 


1.2 Historical Perspective_I 

The problem of estimation arose in a very natural way in problems of 
Astronomy and Geodesy in the first half of the 18-th century. For example 
in Astronomy, the determination of interplanetary distances, determining 
the position of planets and their movements in time were some of the 
important problems. Whereas in Geodesy, determining the spheroidical shape 
of the earth was one of the most important problems. It was known that the 
figure of the earth is almost a sphere except for some flatness near the 
poles. Observations were obtained on the measurement of the length of one 
degree of a certain meredian and the problem was to determine the parameters 
a and p which specified the spheroid of the earth. Indirect observations on 
(a, p) were given by the relation 

y,- = a + px h i = 1, 2 , ..., n 

where x-s are known fixed constants. Note that (a, p) are uniquely determined 
if only two observations on Y at different values of (x h x 2 ) are available. 
However, as is customary in science, several observations were made at 
different values (x h ..., x n ) and this led to the theory of combination of 
observations with random error which directly or indirectly measured 
“magnitudes of interest" or parameters. 



6 A First Course on Parametric Inference 

Most of the students today know that the estimates of a and are obtained 

// 

by the Method of Least Squares by minimizing Q = Z (y, - a - fix;) 2 

/=1 

where y } -a- /fo, is called as error or residual. The method of least squares 
was proposed by Gauss and Legendre both very well known and distinguished 
mathematicians of 18-th century. There is some controversy regarding priority 
between them. Legendre published the method in 1805 studying it in great 
details but it appears that Gauss has been using the same since 1796. This 
is not the place to go into further details of this controversy but we note 
that about half a century earlier Boscovitzh (1757) had proposed a solution 
to this problem using different approach. 

Boscovitch suggested that the estimates of a and /3 be determined such 
that 

(i) The sum of positive and negative residuals or errors should balance 
i.e. E (y, - a - fix,) = 0 and 

(ii) Subject to the above constraint we determine (a, fi) such that R = 

n 

X I yi - a - pXj I the sum of absolute values of errors is as small as 
possible. 

Using geometric argument Boscovitch solved the problem for the five 
observations that he had. Laplace (1789) gave a general algebric algorithm 
to obtain estimates of (a, fi) on the above principles for any number of 
observations. It may be pointed out that Boscovitch method is the first 
instance of solving an optimization problem with constraint. Laplace (1789) 
in fact preferred Boscovitch method over that of Cotes (1722) which was 
then currently in use. 

Boscovitch’s contribution to the solution of the problem of “determination 
of magnitudes of interest” or the estimation of unknown parameters is a 
fundamental contribution in that it prescribed the method of estimation 
using some basic principles. Boscovitch suggested the fitting of straight 
line y = a + px to the data {(*,, y,) ... (x n , y„)} requiring that the fitted line 
should pass through the point (x, y) and the sum of absolute deviations of 
the points (*,-, y,) from the fitted line in the direction of y-axis is minimized. 
Later in the method of Least Squares, Legendre and Gauss proposed fitting 
of the straight line using the principle that sums of squares of deviations 
in the direction of y-axis be minimized. Note that the line, fitted by method 
of Least Squares, passes through the point (x, y). 

In the same spirit later in 19-th century K. Pearson proposed the method 
of moments and Fisher (1912) advocated the method of Maximum Likelihood. 
With several methods of estimation being proposed on different and possibly 
equally appealing grounds there arose the inevitable question of comparing 
estimates obtained by using different methods. 

Prompted by a remark of Edington in his book “Stellar Movements” 

that <r, = Jitfl ^ y L\X i -X\ has some theoretical advantage over 



Introduction 7 


fi -1 1/2 

< 7 2 = j-1 ZI X, - X Ij , Fisher (1920) decided to investigate the performance 

of these two estimators of cr and laid down the basic principles as to how 
this comparison should be made. Fisher first obtained the sampling distribution 

of <7j and <7 2 and their means and variances and their mean squared errors 
(MSE). Comparing MSE (o^) and MSE (o , 2 ) he showed that the sampling 

distribution of <7 2 is more concentrated around <7 than that of <7] assuming 
that errors are normally distributed. Thus Fisher laid down the foundation 
of the present practice that the performance of an estimator has to be 
judged on the basis of the sampling distribution of the estimator and a few 
characteristics of such a sampling distribution such as the mean, the variance 
or bias and the mean squared error. 

The concept of the sampling distribution is a relatively new-concept and 
the standard sampling distributions such as % 2 , t, F, and that of sample 
mean, variance, multiple and partial correlation coefficients etc. were derived 
during the second quarter of the 20-th century. Now many of these derivations 
are taught at the final year undergraduate level. Such sampling distributions 
can be obtained when one assumes that (X l5 ..., X n ) has a given joint 
probability distribution. Once such a model for the data is assumed then 
using the calculus of probability and standard techniques .such as 
transformations, one can at least theoretically obtain the distribution of an 
estimator. The distributions such as £ 2 , t, F etc. were obtained by assuming 
that observations constitute a random sample from a normal distribution. 

Primarily under the leadership of Fisher during the decade of 1920-30, 
foundations were laid down for the theory of estimation. Indeed Fisher 
(1922) in his pathbreaking paper entitled “On Mathematical Foundations of 
Theoretical Statistics” listed the basic problems of theoretical statistics as 
follows: 

(1) Problems of Specification: Defining the distribution of the population. 

(2) Problems of Estimation: Obtaining from the sample, the statistics Or 
estimates of the parameters of the population. 

(3) Problems of Distribution: Obtaining sampling distribution of statistics 
derived from the sample. 

Following Fisher (1922) we adopt the attitude that the problems of 
specification have been satisfactorily resolved by the scientists and practising 
statisticians in the field and would thus assume that the stochastic behaviour 
of X is specified by the class of pdfs {/(*, 0), 9 € Q}, where/is known. 

We will also assume that the problems of distribution have also been solved 
and we thus concentrate on the problem of estimation. Following Fisher we 
will compare the performance of various estimators on the basis of their 
sampling distributions under assumed model, namely (X,, ..., X„) are 
independent identically distributed random variables each with pdf belonging 
to the class {/(*, 0), 0 € £2} and an estimator T has the induced probability 
distribution given by the corresponding class of pdf {g(f, 9), 9 e Q}. 



8 A First Course on Parametric Inference 

f 

Following Section 1.3 presents a brief historical review of some of the 
models that arose in a natural way. 

1.3 Historical Perspective—II 

Recall that in the simplest model where Observation = True value + Error, 
the errors of overestimation and underestimation must balance out (Boscovitch 
assumption). Simpson (1776) translated this idea by assuming that errors 
are symmetrically uniformly distributed about zero or the model is given 
by X, = 6 + £j where the pdf of the error is given by f(e) = 1/2 h, - h < e 
< h. Euler (1778) proposed the arc of a parabolic curve given by 

/(£) = (r 2 - e 2 ), - r < e < r as the pdf of the random error. Euler seems 

4 r J 

to be the first one to propose the distribution of errors with the basic 
property that large errors, i.e. large values of I e I are less likely than the 

small values of I e I. Laplace proposed the pdf/(e) = exp {— I e l//z} 
for - o© < e < oo as the model for distribution of errors and Gauss proposed 

the normal distribution with pdf /(e) = . I exp {-e*/2h 2 }, - < e < 

V 2nh 2 

«>. It is important to point out here that the double exponential distribution 
used by Laplace to represent error distribution led to the median of the 
sample as the “best” estimator of the “True value” whereas the normal 
distribution used by Gauss led to the mean of the sample as the “best” 
estimator of the “True value”. 

The popularity of the normal error model was mainly due to the ease 
with which Least Square Method could be applied as opposed to the method 
of least absolute deviations originally proposed by Boscovitch and followed 
by Laplace. Further the normal distribution arose also in the context of 
approximating the binomial probability of r successes in n trials given by 
De-Moivre Laplace limit theorem, namely 


[ n r )p'Q -P) n - r 


*j2xnp(l - p) CXP 


2np{\ - p)* 


The binomial distribution as a model of occurrence of an event E in a 
series of trials arose from the games of chance but it was also due to the 
interest in sex ratio, male to female births, in new bom babies. The Bernoulli 
series of trials led to Poisson distribution to model the situation when n , the 
number of trials is large but p, (the probability of the event) is very small. 
Other problems related to Bernoulli series of trials led to such other well 
known distributions as geometric, and negative binomial among others. 
Now Poisson, geometric and negative binomial are generally asymmetric 
discrete distributions and differ very much from the symmetric error 
distribution models evolved earlier. Further K. Pearson observed that many 



Introduction 9 


data sets collected by him and other statisticians in connection with biometrical 
measurements indicated that the characteristic of interest (X) would have a 
distribution which could be skew on either side and very far from normal. 
Gaining experience from such collected data sets, Pearson proposed a very 
wide class of distributions now known as Pearsonian system of curves. As 
was customary then in applied mathematics, the pdf was defined by a 
differential equation 

df(x) _ (x - a)/(x) 
dx bo + b\X + b 2 x 2 

where (a, b 0 , b t , b 2 ) are constants which determine the nature and shape of 
the pdf. Depending on the nature of the roots of the quadratic in the 
denominator the equation leads to variety of pdfs. [Kendall and Stuart, Vol. 
I (1958)]. Assuming that the first four moments exist, Pearson showed that 
the constants (a, b 0 , b Y , b 2 ) are uniquely determined by the first four moments 
of the pdf or equivalently by the mean, variance, Skewness, Kurtosis. Since 
the distribution is uniquely determined by the first four moments, fitting of 
the frequency curve (pdf) by method of moments became a standard exercise 
which was further facilitated by detailed technique prescribed in Biometrika 
Tables for Statisticians (1954) prepared by Pearson and his colleagues and 
included in the text by Elderton (1954). However in this transition the 
“magnitudes of interest” or the parameters of the distribution were defined 
indirectly in terms of moments of r.v. X. 

The next major advance was made by Fisher (1912) when he presented 
the method of maximum likelihood for determining the parameters occurring 
in the distribution of X given by the pdf f(x, 0) in which the function / is 
known and the constants 0 = (0j ..., 9 m ), are determined on the basis of a 

random sample of size n on X with joint pdf L(x, 9) = II f(x h 9). The 

i=i p 
n 

function L(x, 9) = IT f{xj , 9) for fixed x — (xj, ..., x„) and variations in 9 

i=l 

was defined by Fisher as the likelihood function. Note that for discrete r.v. 
X, L(x, 9) is the probability of observing X = x for a given value of 9. As 
9 varies over the set of possible values ft, one can establish an order 
relationship among the values of 9. We say that 0, is more likely than 0o 
if L(x, 0 t ) > L(x, 0 2 ) and 0 2 is more likely than 0, if L(x, 0,) < L(x, 9 2 ). 
If L(x, 0j) = L(x, 0 2 ) then on the basis of the data X = x, 0, and 0 2 are 
equally likely. In his method of maximum likelihood, Fisher proposed to 
estimate 0 for a given value of X = x by that value of 0 for which L(x, 0) 
is maximum. 

In all these approaches to estimation, it is assumed that it is possible to 
suggest an estimator of 0 by making observations on X. This implies that 
we assume that observations on X are somehow providing information on 

One can ask the question as to why we feel that observations provide 



10 A First Course on Parametric Inference 

some information about the unknown parameter. Note that the probability 
distribution changes as G changes due to our assumption of G being an 
indexing or lebelling parameter. Therefore the likelihood function L(x, 6) 
changes as G varies over £2 and it is this change in the likelihood function 
that provides information about 0. If L(x, G) were to remain constant for an 
observed x then such an observation x would be regarded as not providing 
any information about G . At the base of above thinking is the assumption 
that if the probability distribution does not change with the parameter G 
then observations from such a distribution would not provide any information 
about G. Similarly, if T(x) is a statistic then its probability distribution 
must effectively depend on G if T is to provide information about G. If the 
probability distribution of T does not depend on G then it cannot provide 
any information about G. For example, for a sample of size two from 
N(G, l),Ti=Xi +X 2 provides some information about G since T x ~ N(2G, 2) 
but T 2 = X\ - X 2 will not provide any information about G as T 2 ~ N{ 0, 2). 

Chapter 2 will consider information in detail and study the concept of 
sufficiency of a statistic T which plays a vital role in Statistical Inference. 



2 

Sufficient Statistic 


2.1 Motivation 

Let (X h X 2 , ..., X n ) be i.i.d.r.v. with pdf belonging to class {f(x, 0), 0 e £2}. 
Let T(Xi, ..., X n ) be a statistic with corresponding pdf belonging to the 
class {g(t, 0), 0 € £1). Following Fisher (1925) we call T a sufficient or 
an exhaustive statistic if it contains all the information about 0 that is 
contained in the samle (X u X 2 , ..., X n ). Note that so far we have not 
quantified the amount of information either in the sample (Xi, ..., X „) or 
in a statistic T. However as observed earlier we have assumed that if a 
sample has its pdf not depending on the parameter 0 then it does not 
contain any information about 0. Using this logic consider any other statistic 
T\ and consider the conditional distribution of T { given T with pdf h(t { , 0 U t). 
If this does not depend on 0 then the conditional distribution of T\ given 
T does not contain any information about 0. If the statitic T is such that the 
above holds for every other statistic T\ and for all possible values T = t then 
Fisher defined T to be sufficient or exhaustive. Note that in this case the 
conditional distribution of any other statistic T\ given T is independent of 
0 and therefore does not provide any additional information than that which 
is already contained in T. On the other hand if T is such that the conditional 
distribution of a statistic T { given T changes with 0 then there is some 
information about 0 in the conditional distribution of T\ given T and T is 
not sufficient or exhaustive. Fisher (1920) noted this phenomenon first in 
connection with the comparison of the estimators &\ and < 7 2 of the standard 
deviation of the N(/li, <7 2 ) model mentioned in Chapter I. He observed that 
the marginal pdf of <7i as well as that of <r 2 both depend on a and as such 
contain information about cr. But whereas, the conditional distribution of 
<7 1 given <r 2 does not depend on cr, the conditional distribution of <r 2 
given < 7 , does depend on <7. Fisher interpreted this as having observed <j 2 , 
no additional information about cr can be obtained by considering <7]. On 
the other hand if we observe < 7 b the statistic a 2 provides still some additional 
information about cr. One must appreciate the genius of Fisher which from 
a single specific example formulated the property of sufficiency which 
later proved to be a fundamental concept in the theory of Statistical Inference. 
Fisher’s initial work about <7,, <7 2 is quite complex. We consider the following 
example which involves fairly simple calculations and illustrate the 
phenomenon discussed above. 



12 A First Course on Parametric Inf ere 

EJUMFU2.I.1 L« <x " « Now by sondari 

— 1 <BVN) «* ^ 

and the conditional distribution 


vector (ft 2 0)' and covariance matrix 


A 


2y 


, v . y , y — e i« Mr/2, 1/2) which does not depend on ft Th 
l ' ThV Jx X dL not provide any additional informal 

rr.r&LMi—- 

X, = X, is normal with mean (*, + » and vananee one _Thus condm^ 
distribution of (X, 4 X 2 ) given X, has some mformatton about ft In g enera| 
consider T, = IX, 4 mX 2 and T = (X, + * 2 ). Then (lit. D »* »VN wift 

/ 2 + m 2 (/ + m) 

(/ + m) 2 


mean (6(1 + m ), 2 ft)' and covariance matrix 


• Then the 


conditional distribution of T x given T = t is normal with mean ~ and 

variance - ~^ m —. Therefore, conditional distribution of /Xj + raX 2 given 
Xj + X 2 — t does not contain any information about ft On the other hand the 

conditional distribution of T for T x = h is normal with mean 26 + - [ - m - 

(/ + m L ) 

(l — m l 2 

[t x - (l + m)6] and variance -4- Thus distribution of T given 7, =ij 

depends on 6 and therefore contains some information about ft Note that 
conditional distribution of (X! + X 2 ) for fixed IX x + mX 2 = t x is independent 
of 6 if and only if 2 - (/ + m) 2 /(l 2 + m 2 ) = 0or l = m = k say i.e. /X! + mX 2 
is a constant multiple of (X x + X 2 ) and therefore fixing /Xj + mX 2 = fj fixes 
X\ + X 2 uniquely to be equal to t x /k and the conditional distribution of 
(X, + X 2 ) is a singular distribution with entire probability mass concentrated 
at the point t x /k. 

The above analysis shows that IX\ + mX 2 where l&m can not be sufficient 
as there exists a statistic (Xj + X 2 ) such that its conditional distribution, 
given /Xj + wX 2 = t\ depends on ft Although conditional distribution of 
IX 1 + mX 2 given X, 4 X 2 = , is independent of ft for any choice of (/, m)' 
this does not prove the sufficiency of (X, + X 2 ). To prove sufficiency of T 
- (X, + X 2 ) we must show that for any statistic r 2 (X,, X 2 ) the conditional 
distribution of T, given T = (X, + X 2 ) = r is independent of ft Let h = 

X, + X 2 then for X, + X 2 = r, w e have X 2 = , - X, and for fixed /, h = 

. ( -° f v = 2X < ~ 2lX > + r = 2(X, - tnf + ,2/2. Now conditional 
distribution of X, given X, + X 2 = , is N(t/2 , 1/2) and therefore 2(X, - 1!2? 

is x? and as *2 is fixed the conditional distribution of r 2 given (X, + X 2 ) =' 
is independent of ft This exercise ;, 26 cu V I 2/ 

of (X, + X 2 ) and shows how we c ^ove of 

Consider any statistic 7,(X, X ^ fh#»n f r- u ^ lcienc y of 7 = (Xj + 

3( ‘’ Xz) then fo r f-xed T = ,, we have r 3 (X„ X 2 ) 


Sufficient Statistic 13 


= r 3 (X b t-X { ) which is purely a function of X,. Since conditional distribution 

_ .. .. . -.( t 1 . .. . • i _.r rr/v \ 'T’/'V < v \ 


of Xi given X { + X 2 = Ms N\^, jj the distribution of U(X { ) - 7 3 (X,, t - X,) 
is independent of 0, and thus T = (Xj + X 2 ) is sufficient. 

Example 2.1.2 Let (X b ..., X n ) be i.i.d. Poisson with mean 0 and let 

T = 2 X b Then we know that T is Poisson with mean n0 and 

/=i 

P[X = X I T = t] - P(X, = Xi, ..., X„ = x n I r = t) 

P(X, = *i,..., X„ = *„, 2X, =f) 

__*=1_ 

nx x, = 0 

*=i 


/>(*! — f Xj) 


P(X x, = 0 

*=1 


n-1 . J 

n - [ e~ x X Xi /e- nX (nXy 

It -:— * -—;- / -1- 


<=1 X,-l 


n-1 

(f - L X,)! 


.. n' 


jt,!x 2 ! (f - 2 *,!) 

1=1 


ft i n 

= ~n~ -7 for x, > 0, 2 Xi - t 

nx ,! n « 

(=i 

= 0 otherwise. 

This shows that-.the conditional distribution of (X b X 2 , ..., X„) given 

n 

2 X, = t is a multinomial distribution in n equiprobable cells and frequencies 

It 

(X b ...» X n Y such that 2 X,* = t. Since this conditional distribution on the 

1=1 

sample space of (X b .... XJ given 2 X, = t does not depend on 0 , it follows 
that for any subset A of sample space, P e (A I A t ), where A, = {x I I jc, = 

t) does not depend on 0, as P e (A I A t ) = 2 P[X = x \ T = t]. Bpth the 

xeAnA, 

n 

examples point out that the constratint 2 X, = t determines one variable 



14 A First Course on Parametric Inference 

v • r • • „ /„ n variables (Xj, X 2 , X,,^) and th e 

say X„ in terms of remaining (n i - U iven 7 = / does not de Pen ! 

conditional distribution of (Xj, X 2 , •••» Xn-w ^ ^ 

Above idea can be generalized to a vector _ k f J ^ 

" F T!) fr^'a'nd theRemaining * vanables ^ 

say (A b A 2 , ..., x n _ k ) tree ana me and any real Qr ye 

be expressed as functions of (X 1? X 2 , • ••» n-k) f . r 

valued statistic UfX, .X„) given T=t can be expreased a* of 

(X.. *„_*) and one can work out conditional d.s nbunon of £/(*„ .... 

X n ) given T = t from the conditional distribution o ( i> •••* « * ^ 1Ven 

r=r. If this distribution is independent of 0 for each t, it follows that 
distribution of any statistic U(X [,..., X„) given T- t wou e in epen ent 
of 0 and T would be sufficient. 

Remark 2.1.1 Note that for any two statistic T h T with joint pdf g(t, r,, 0) 
the following factorization holds 


o( t ft 


— Orx(t ft) • 6 I t) 


1 


where go (*» 0) is the marginal pdf of T and gy(ty, 0 I t) is the conditional 
pdf of Ty given T = t. If gy(ty, 01 t) does not depend on 6 for all possible 
values of t then there is no information about 0 in the conditional distribution 
of Ty given T = t for each t. If this holds for every T\, then T is sufficient. 
Note that (2.1.1) holds for vector valued 0 = (0 lf ..., 0 m Y and vector valued 
statistic 7' 1 , T of dimensions say ky and k respectively. 


Remark 2.1.2 The observations themselves are always sufficient. For 
consider a random sample of size n, (Xy, ..., X n Y from [f(x, 0), 0 6 Q}. 
Then consider T to be the n dimensional statistic (Xj, ..., X,,)’. Then given 
T -t i.e. Xy = jcj, X 2 = jc 2 , ..., X„ = x n , for any real of vector valued statistic 
Ti t Ty takes only one value Ty(x i,..., n„) = ty and the conditional distribution 
of Ty given Xy “ JC | J • • • y “ X IS a singular distribution with entire 
probability mass concentrated at the point ty. Since this conditional distribution 
does not depend on 0 for any Ty, and every (xy, ..., * n ) it follow that 
(Xj, ..., XJ is a sufficient statistic. 


Remark 2.1.3 Taking T and Ty = (Xj, ..., X n _J in (2.1.1) we have 


g{t, Xy, ..., X n _ k , 0) — gftit, 0) gi(*|, ..., X n _ k I t, 0) (2.1.2) 

Now consider the transformation y, = x h . i = 1,2, ..., n - k and y „. k+ 1 = 


ty, ..., y n = 4 such that 


^(Aj , • •., X n ) 


* 0. Then the joint pdf of (T,. Yj 


is given by g(yj, ..., y n , 0) - L(hy(y), h n (y), 0) | J~ x \ where J is the 
Jacobian of the transformation and Xj = hj(y), i= 2 n is the inverse 
transformation. Referring back to the (X,, .... X„) co-ordinates and noting 
that when T is sufficient, gy in (2.1.2) does not depend on 0, we get 


Sufficient Statistic 15 


L(x lt .... x n , 0) = go(Ti(x), .... T k (x), 0). h(x ) (2.1.3) 

That is the joint pdf of (X lt ..., X„) factorizes in two parts, one which is 
the pdf of (T h ..., T k ) and the other which is purely a function of x. 

In the above discussion the concept of information in a r.v. about a 
parameter 0 has not been quantified. We consider this quantification and 
define Fisher Information in the next section. 


2.2 Fisher Information 

Let X be a real or vector valued r.v. whose pdf depends on a real parameter 
0 e fl. Then as’indicated in Chapter 1 we assume that the variations in pdf 
of X for given X = x as 0 varies over Q provides us some information about 


0. For example let X be binomial with P(X = x) = /(x, 0) - 






W - 0 ) 


n-x 


and let n = 10 and x = 2 then we have ffx, 0) varies with 0 as follows: 

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 

/(2, 0 ) .1937 .3020 .2335 .1209 .0439 .0106 .0014 .0001 .0000 

It is this variation which provides some information about 0. 

Now as in case of study of motion where change in position is studied 
by velocity and acceleration, (the derivatives of position w.r.t. time t ), the 
information is studied through the derivatives of f(x, 0) w.r.t. 0 for fixed 
x. Let Sff = {x I f(pc, 0) > 0} denote the support of the pdf /(x, 0) under 0. 

Let S = v Sff. Then we can ignore the points x g S, as such points will 
have zero probability under any 0e Q. For example in Binomial distribution 
with known n, S e = {0, 1, 2, ..., n) and we need not consider x g S e . It 
may be pointed out here that although /(2, .9) is given to be .0000, this is 

( 10 \ , 8 

(.9) (.1) upto four places of decimals. 


an approximation to 




We now assume that (i) S e does not depend on 0, i.e. S e = S, V 0 e Q 
and (ii) the pdf is such that differentiation under integral sign in the identity 


1 


ffx, 0) dx = 1 V 0 € ft, is valid at least twice. We then can obtain 

dlog/(x, 0) 




00 


/(x, 0) dx - 0 


( 2 . 2 . 1 ) 


and 


<? 2 log/(x, 0) 

00 2 


/(x, 0) dx + J |~ 


dlog/(x, 0) 

00 


/(x, 0)dx = 0 (2.2.2) 


l 

Now —— is the rate of change of the log-likelihood of X = x at 
0 * say ‘velocity’ and similarly, — 


represents the “acceleration”. 



16 A First Course on Parametric Inference 


The functjon 3togf(x,0) v . ewed as , function of * for fixed 0 is call ed 
as a “score function 1 and for each fixed ft is a random variable. Then 
(2.2.1) says that 


o 


(9 log fix, d) 
dd 


= 0 


(2.2.3) 


and (2.2.2) gives 



r d 2 \ogf(x, 0 ) 

— F 

d log/(x, 0 ) ^ 

2' 

Ee 

- 1 

<N 

1 

- £e 

( de J 



= Var# 


01og/(*, 0) 


dd 


(2.2.4) 


We define Fisher Information about 0 in a r.v. X, by 


4(0) - 


d 2 logf(x,0)) 

— 17 

( dlogf(x, 0)^1 

2' 

d0 2 J 

- Eq 

{ 30 J 



Observe that 4(0) > 0 and 4(0) = 0 if and only if 

2 


'6 


d\ogf(x, 0) 
dd 


s 0 or ^ ^ "" = 0 with probability one or the 

ou 


pdf of X , /(*, 0) does not depend on 0 or the distribution of X does not 
change with 0. 

If (X h ..., X n ) is a random sample of size n with joint pdf L(x, 0) = 

n 

II f(x h 0), by taking logs and differentiating twice w.r.t. 0 and taking 

i-l 

expectations with negative sign we have Fisher information in the sample 

hxi,x 2 ...x„) (0) = n lx(0) (2.2.4) 

Let T(X i , X n ) be a real or vector valued statistic with corresponding pdf 
{g(t, 0), 0 e £2}, then anologously one can define the Fisher information 
in a statistic T as 

U0) = E e H 2| °ggM>) = E 


dd 


( d log g(t, 0 ) 

"( de 


(2.2.5) 


We must however assume that the range of T say r= {H g{t, 0) > 0} does 
not depend on 0 and differentiation under integral sign in the identity 

J g(t, d) dt = 1, V 0 € £2 is valid at least twice. 



Sufficient Statistic 17 


Next we observe that if (Xj, X 2 , X n )' is a random sample of size n 
and if O',, Y 2 . YJ is any one-to-one transfonnation with g ——j 


=171*0 then 


g(yi, •••> ^n* — L(x b •••* x n> I J \ 


where x t = h(y h ..., y n ), i = 1,2 ,n is the unique inverse transformation. 
Noting that I 7 -1 I does not depend on 0, taking logs and differentiating 
twice w.r.t. 0 and then taking expectations with negative sign, it follows 
that 


W.,y„)(0) = Ax 1 . x n ) (0) - nI x(0) 

Next suppose 7 = (T\, ..., Tf)' is a statistic then by (2.1.2) 

L(x, 0) = g 0 (t, 0) h(x b ..., x n _ h 0 I t) 

Hence log L(x, 0) = log g 0 (f, 0) + log h(x b ..., x n _*, 01 1 ). Taking derivatives 
twice and taking expectations we have 

A*i.x„)(0) = h(0) + £{/ (Xl .x n _*)ir-f)(^)} (2.2.6) 

where /(x,,x 2 .x„_ t ir=f)(0) is the (conditional) Fisher information in X b ..., 

X n -k given 7 = t. As observed earlier we have /(x,,x 2 . x n „ k \T=t)(0) - 0 with 

probability one and therefore the second term on RHS of (2.2.6) is non¬ 
negative. Noting that /(x,,.,,x„)(^) = we have for any ^-dimensional 
statistic (T b ..., T k ), with k < n 

-f(7i,...,7*)(0) ^ w/x(0) (2.2.7) 

This shows that information in a statistic 7 is always less than or at most 
equal to that in the original sample. Since constructing statistic is equivalent 
to a reduction of data, any reduction of data would generally involve a loss 

of information. There is no loss of information if the second term on RHS 

» 

of (2.2.6) is zero which is possible only when 7 (X|Xn k \ Tat) (0) =; 0 with 

probability one or gi(x,,..., x n _ h 0 1 1) for each t does not depend on 9. But 
as seen in Section 2.1 this implies that 7 is a sufficient statistic. 

Therefore we have now a quantification of information about a paramenter 
0 and under certain regularity conditions a result that the information in 
any statistic 7 is always less than or at most equal to that in the sample. 
There is no loss of information if and only if 7 = (7,,..., T k ) is a sufficient 
statistic, in the sense that conditional distribution of (X,, ..., X n _ k ) given 
T=t does not depend on 0 for each t and therefore conditional distribution 
of any other statistic 7, given 7 = t does not depend on 0. To fix the ideas 
we consider an example. 




18 A First Course on Parametric Inference 


Example 2.2.1 Let (X { , X 2 ) be i.i.d. N(9, 1) then a straightforward calculation 
gives I iXltX2) (d) = 2. Consider T, = (IX l + mX 2 ) ~ N((l + m) 9 , l 2 + m 2 ) then 


g(t\> 0) = 


exp {— (fj — (l + m)0) 2 /2(l 2 + rn)} 


' yj27t(l 2 + m 2 ) 

Taking logs and differentiating twice w.r.t. 6 we have 

-■? 2 log g . (I + "ti 1 and , T (0) = i + 

d8 2 (/ 2 + m 2 ) 1 (I 2 + <n 2 ) 

Now - /,«?) = 2 - represents the loss of 

information due to using T x instead of (J£j, X 2 ). Note that there is no loss 
of information iff / = m i.e. T = k(X\ + X 2 ) which is sufficient. As already 
seen in Example 2 . 1 . 1 , the conditional distribution of X x given T { is normal 

., ™ (l + m) , .. . , . (/ - m) 2 

with mean 29 + [ft - (/ + m)6] and variance —=- 5 -. 

r + m L (l + m L ) 


log g!(x h 9 I /i) = - j log 2 k jf- JU L - 

t (r + m*)_ 


xj -29- - m \ [f, - (/ + m)9] 

_ l_ + m_ _ 

2(1 - m) 2 /(l 2 + m 2 ) 


and _iil£| g> = JLzm? 

d9 2 (l 2 + m 2 ) 

Thus the identity I (Xl ,x 2 ) ( e ) = h x (0) + E[I X[]T=t (0)]takes the form 2 = 
(l + m ) 2 t (l-m ) 2 , /irtx „ (l-m ) 2 

WTm 2 ) + jWj a " d = ° ,ff W^m*) =0oTl = m - 

Example 2.2.1 (Contd.) Now consider a sample of size n from N(9, 1) 
then the information in the sample W- .,x„)(&) = n and let Tj = Z CjXj. 

i=\ 

Then T { ~ N(£ cfi, Z cf) and one can easily show that I T[ (9) = ~- c '^ 2 < n 

X Cj 

and equality holds iff c, = c 2 = ... = c„ = /. Note that if T 2 is such that 

X c, = 0 then T 2 ~ N(0, Z cf) and I Tl (9) = 0. 

We note that 1^9) is well defined under certain regularity conditions on 

dtoLT °* pd V f(X ' 6)y 6 € namely ran S e of the nv - X should not 
and ints»« n *• ° r 9 ~ S an ^ further the operations of differentiation w.r.t. 9 
and integration w.r.t. x can be interchanged atleast twice. 



Sufficient Statistic 19 


For models such as Uniform disribution over (0, 0) or exponential 
distribution with /to 0) = exp {-(* - Q). x > 9, the support depends on 
0 as S e = (0, 0) and (0, «>) respectively. Hence Fisher Information is not 
defined for these distributions. Even when S e does not depend on 9 as in 
N(9, 1 ) case we still require validity of interchange of operations o 
differentiation and integration. Many standard texts on Advanced Calculus 
would give sufficient conditions under which such an interchange of operations 
is valid. For validity of the interchange of operations of differentiation at 
0 O and integration w.r.t. x, it is sufficient that 


(a) 


/( x, 0 o + h) -/(*> Bp) 
h 


< Gq(x) for all I h I < 8 


and | G 0 (*) dx is finite and (b) ^ exists for almost all values of x 

for each 0 e (0 O - <5, 0 O + 8). If these two conditions are satisfied then 
differentiation under integral sign is valid once. For second derivative we 
require 


(a') 


/(x, 0 O + 2 /i) - 2 /(x, 0 O + h) + /(x, 0 O ) 

h 2 

< Gjto for all I h I < 8 


and (b') 


21 

M 2 


exists for almost all x for each 0 = (0 q - 8, 9q + 8). The 


phrase ‘almost all’ in the conditions (b) or (b r ) may be explained by way 
of an examle. Consider Double exponential distribution, or Laplace distribution 


with pdf/to 0) = j exp {- I x - 0 I }, x e R\, 9 e R x . Here S d = R x and 

df 

the range does not depend on 0. However ~ does not exist at 0 = x and 

ad 


df 

therefore the set of exceptional points at which 3 ^- does not exist for 

ad 


06 (00 - 8, 0 O + 6 ) is the interval ( 0 O - 6 , 0 O + 5) which has +ve 
probability under any 0. Now we say that a property Q holds for almost all 
x under the pdf/(x, 9) if the set of points Q c where the property fails has 

probability zero, i.e. /(x, 9) dx = 0 under each 0 e Q. Here Q c = ( 0 O - 8 , 

JQ C 


0o + 5) and P 0o (Q e ) > 0 . Therefore the condition (b) does not hold. We now 
show that for N(9, 1) family, 0 g Q = /?,, both conditions hold. 


Example 2.2.2 Let X ~ N(9, 1) then S@ = R x and the range of X is independent 
of 0. Now /(x, 0) = ~r= exp 


yflir 


(x - 0 ) 


, x € R x , 9 e R]. Now consider 



20 A First Course on Parametric Inference 


At = 


f(x, 0 O + h) - f(x , 0 O ) /( x, 0 O ) 


f(x, 0 O ) , 1.2 


A h 

[e~^ /2 e h(x ~ 9 °) - 1 ] 


fjx, 0 o + h) 
fix, 0 O ) 


- 1 


Hence 


/(•*, 0 O + /t) -/(*, 0 O ) 


</(*, 6 b) 


e Hx-9 0 ) _ | 


as € /,2/2 < J for 


I h I < & Now consider the function 


expanding ie ha - 1) in powers of (ha) and using I X a,-1 < X I a,- I we have, 


e 1 * - 1 


for 0 < I h I < £ then 


for I h I < <5, 


e ha -1 


" I £> ^1 1 

<2 I S r 1 I a 17 r! = g ~ 1 


r=l 


Hence 


e ha -1 


S\a\ 


^ -g- and therefore 


M| ' £ l5lV2i exp ' 




= G 0 (x) (2.2.8) 


Now /, = I G 0 (x) * can be evaluated explicitly by using transformation 

JRl 

w = (x - 0 O ). Then 


/i 


— f - 1 r ~ w ' 2/2 

8J Rl f2t e 



(2.2.9) 


We note that integral in (2.2.9) is the moment generating function (mgf 

of I IVI evaluated at S where W ~ N( 0, 1) and can be explicitly evaluated 

by completing the square in the exponent. Now observing that the integrand 
is an even function of w we have 




S 2 /2 


yflnS 


f 

Jo 


exp 


(w - 8 ) 2 


dw 


le 51 ' 1 

= ~[\-0(-8)} 
- 2 e ,2/2 




IA 2 I = 


fix. Op + 2h) - 2 fix, 0n + A) + /(*, 


^o) 



Sufficient Statistic 21 


e h(x-9o) _ 1 


f{x, 0 O ) ^ 


28\x-9 n \ 


fix, 0 O ) 


Therefore we take G](x) = — 


1 1 


5 2 -fin 


exp i 


- ix - 0 O ) 


+ 251 x — 0 o 


> and 


f G,(jc) dx = -^7 ( M gf of I WI at 25) = e -45 /2 #(25) and here also 

J/?i o 45 

G^x) is integrable. 

Exercise 2.2.1 (i) Consider/(*, 0) = 0e~ 9x , x > 0, 0 > 0, which is the pdf of exponential 


distribution with mean 4 (or failure rate 0). Show that differentiation under integral 

U 

sign is valid twice and the Fisher information 1^0) = 

u L 


(ii) Consider fix, 0) = —--- ■?, x e /?,. Show that the Fisher information 

J n\+(x-0) 2 1 

1(0) = 1/2. Here validity of differentiation under integral sign is more difficult to prove 
than that in the above examples. 

If the distribution of the r.v. X depends on an m-dimensional parameter 
0 = (0i, 02, 0 m Y then we define the Fisher information matrix under the 
conditions that 

(a) The range of the r.v. X or the support of pdf S 0 = {jc I fix, 0) > 0} 
does not depend on 0. 

(b) Interchange of partial differentiation w.r.t. 0 r , 0 S and integration w.r.t. 
x is valid in the identity. 



<£t=lV0e£2c 



Anologous to (2.2.3) we can show that 



d\og /(*, 9) 
dO r 


= 0, r = 1, 2, ..., m 


(2.2.9) 


and 


- d 2 log fix, 0) 

- zr 

5l0g/ <9l0g/l 

do r do s 

- &Q 

--1 

k. 

_1 


JrsO) 


We define then the Fisher information matrix 7^(0) as the (m x m) matrix 
((•4 j( 0))). We observe that Jy{0) is the variance covariance matrix of the m 


dimensional vector of score functions 


/ 


01 og/ d log/ 


V 


do 


do 


m J 


and as such 


JM is a non-negative definite (nnd) matrix. 

For a random sample of size n then analogous to (2.2.4) we have 


f(Xi . x„)(0) - nJx(fy- 


1^157 *T 



22 A First Course on Parametric Inference 


We will not go into details here but observe that if T = (7j, T 2 , .... T k y j s 
a statistic with pdf {g(t, G), G e £2 c /?J then analogous to (2.2.6) We 1 
obtain the matrix equation 

= Jt(Q) + EeUiXx . x n .k)\T=t(Q)] 

There is loss of information due to reduction of data by constructing a statistic 
T in the sense that the difference between the Fisher information matrix of 
the sample and that of the statistic T is a nnd matrix and we say that there 

is no loss of information if J iXx . x„ )(#)= M&) or E ei l(x ,. x„ k w=t (0)] is 

a null matrix. This is possible only when 7 (Jfl . Xn _ k )\T=t (0) = 0 probability 

one or g|(^], ..., x n _ k I T = t) does not depend on G which implies that T is 
a sufficient statistic. 

2.3 Defining Sufficient Statistic i 

The above motivation for sufficient statistic through information is 
conceptually very useful but checking sufficiency or otherwise of a statistic 
T in the above way requires verification that for any other statistic 7j 

g{t h t, G) = g 0 (t, 0) gi(ti 6 I t) (2.3.1) j 

where g[{t\, 9 1 1) the conditional distribution of T x given T = t is such .that 
it does not depend on 6 for each T = t. On the other hand the result (2.1.3) 
namely the joint pdf of (X h ..., Xnf factorizes into 

L(x, G) = go{T x {x), ..., T k (x), G) h(x) (2.3.2) 

where * = (x h x 2 , ..., x n )' and g 0 {T x {x), ..., T k (x), 6) denotes the pdf of (T u 
..., T k Y under G. This is a verifiable defination. In fact in (2.3.2), h is the 
conditional distribution of (X b ..., X n )' given T = t. Thus we propose the 
following working definition of a sufficient statistic. • 

Definition 2.3.1 Let (X x , X 2 , ..., X n Y be a random sample of size n on a 
r.v. X with pdf belonging to the class {/(*, G), G e Q). Then a statistic T 
with corresponding pdf belonging to {g 0 (t, G), Ge ft} is said to be sufficient 
for G, or more precisely for the family of pdfs [f{x, 6), G e £2} indexed by 
G, if the following factorization holds 

L (x, G) = g 0 (T(x ), G) h(x) (2.3.3) 

where go (T(x), G) is the pdf of T(x ) under G and h(x) is a function of x only. 

The above factorization must hold for each G e D and almost all jc in 

S l = y ( x 1 L (*> 0) > 0} the support of L(x, G). 

Example 2.3.1 Consider (X u X 2 , X 3 ) i.i.d. N(8, 1) then 


Sufficient Statistic 23 


L(x, 0) = 


' i ' 3 


V2?r 


exp < 


_ £ ^ ~ 0 ) 2 1 


i=i 


■, X € /?3, 9 G R 


Here S e = /? 3 and therefore 5 L = /? 3 . 

Consider 7 = X, + X 2 + X 3 then 7 ~ /V(30, 3) and 


go(t, 9) = 


1 


27T-3 
1 


exp 


(f - 30) 


21 


>• where r = x\ + x 2 + * 3 


V6tt 


exp 


(Xx, -30) 2 


, X G /? 3 , 0 € R 


Taking logarithms consider the difference log L - log g 0 = A then 


A = C - i i x} + 61.x, - + £ 4 ^ - e&xd + 9e 


3 

2 ,5~' --' 2 

= c - \Zxf + (Xx ,) 2 /6 


Thus A does not depend on 0 and we have L(x, 9) = g Q (t, 0) h(x) and 
therefore 7 is sufficient. 

Next consider 7 = (X l5 X 2 + X 3 )' a two dimensional statistic then 


go(t, 0 ) = 


1 


■\J27C 


exp 


U,-0) 2 1 


1 1 

exp < 

(x 2 + x 3 - 20) 2 

2 

k > 


1^V27T - 2 J 

4 1 


as X\ ~ N{9, 1) and X 2 + X 3 ~ N{29, 2) are independent. Again taking logs 
and considering log L - log g 0 = A we have 

A = c - -5- X xf + 9 X Xj - 9 2 + (X x,) 2 / 4 - 0 Xx, + 0 2 

^ i=2 


i=2 


= c-i2jt? + (i^) 2 /4 


which is independent of 0 and therefore (X^ X 2 + X 3 )' is also sufficient. In 
a similar way it can be verified that (X 2 , X! + X 3 )' and (X 3 , X\ + X 2 )' are 
sufficient for 0. 


Next consider X = \ X X,-. Then X ~ N\ 0, ^ 

3 1=1 1 3 


h 


go(x, 0) = exp 


and 
3(5 - 0) ! 1 


and 


1 1 3 _ 2 


L(x, 9)/g 0 (x , 0) = const, exp \ - T X xf + ^ x 



24 A First Course on Parametric Inference 


which does not depend on 0 and therefore X is also sufficient. We note that 
XX, and X are in one-one correspondence and as remarked earlier if one of 
them is sufficient so would be other. On the other hand if T’i = (X), X 2 + X 3 )' 
and T 2 = (X 2 , X, + X 3 )' then as seen above both are sufficient but T { and 
T 2 are not in one-one correspondence. Further T j and T 2 are not functions 
of each other. Note that (X] + X 2 + X 3 ) is a function of T\ as well as that 
of T 2 but T] or T 2 is not a function of (X! + X 2 + X 3 ). 

Now consider T = (X, + 2X 2 + 3X 3 ) ~ N(69, 14) and 


go(t, 0) = 


^J27^4 


exp 


\ (x\ + 2 x 2 + 3x2-60} 


2.14 


,xe R 3 , 0 € R] 


Again considering the difference log L(x, 9) - log go (t(x), 9) we observe 

that coefficient of 0 1 is - ^ ^ ^ 0 hence the above difference is not 

purely a function of x and therefore (Xj + 2X 2 + 3X 3 ) is not sufficient. 
Suppose T = X j + 2X 2 - 3X 3 . Then T ~ N(0, 14) and g 0 (t, 9) = 


1 


J2n\A 


exp 


(*, + 2x 2 - 3x 3 ) 


2.14 


x e R 3 , 9 e R] which does not depend 


on 9. Therefore obviously L(x, 9)/go(t(x), 9) cannot be independent of 9 
and (Xj + 2X 2 - 3X 3 ) is not sufficient. 


Exercise 2.3.1 (1) For X h X 2 i.i.d. N(0, 1) using the above definition show that 

T = IXi + mX 2 is sufficient iff / = m. 

(2) Generalize the above result for a random sample of size 3 from N(0, 1). 

(3) Letflx, A) = Ae~ lx , x > 0, A > 0. Let (X lt ..., X n ) be a random sample of size n 
then show that X Xj is sufficient. 

(4) For a sample of size n from N(0, a 2 ) where a 2 is known show that X is 
sufficient. 

(5) For a sample of size n from N(0, o 2 ) where 0 is known show that T = X (Xj - 0) 2 
~ o 2 xl is sufficient for o 2 but S 2 = X (X, - X) 2 ~ <x 2 xl-\ is not sufficient. 

(6) For a sample of size n from N(6, o 2 ) where (6, d 2 )' is unknown show that 
(X, S 2 Y, is suffficient for the vector valued parameter (6, a 2 )', where S 2 is as defined 
in (5) above. 

(7) Let the pmf of X be P e [X = 0] = 0, P 6 [X = 1 ] = 20 and P e [X = 2] = 1 - 30 where 
0 < 0 < 1/3. Let (T\, T 2 , T 3 Y denote the vector of observed frequencies of 0, 1 and 2 
in a sample of size n. Show that (T h T 2 , T 3 )' is a sufficient statistic. Let T A = T\ + T 2 
then using the result that T 4 - B(n, 30) show that T 4 is also sufficient. 


Example 2.3.2 Let (X Jt X 2 , ..., X n )' be a random sample of size won a 
continuous real r.v. X with pdf {/(*, 9), 9 e fl}. Then we can form order 

statistic T = (X (1) .X (/0 ))' where X (1) < X (2) ... < X (n) as P e [Xj = Xj\ = 0 

for any 9 e Q i.e. there are no tied observations since the r.v. X is continuous. 
Now T is an n-dimensional statistic constructed from an w-dimensional r.v. 
(X|, ..., Xn)'. However if the support of/is say an interval (a , b) then range 
of T is a < X (1) < X (2) ... < X ( „) < b whereas that of (X h X 2 ... XJ is 
(a, b) x(a, b) ... x (a, b). The order statistic is a reduction of data in the sense 


Sufficient Statistic 25 


that if (jc {1) < ... < x (n) ) is the observed order statistic i.e. if T = t then 
A, = 7^{/} is the set of all permutations of (x (1) , ..., * (n) ) and consists of 
n! points or equivalently any one of the n! permutations of ..., x (n) ) 
would give rise to the same order statistic (x (1) , ..., *(„>)'. Note that the 
range space of the order statistic is a proper subset of the range space of 
the observations themselves. The joint pdf of the order statistic is 

n 

go(*d), *(„)> 0) = n! n /(*(/), 0), a < * (1) < x {2) ... < *(„) < b 

1=1 

= 0 o.w. 

II 

Now L{x, 9) = n f(x h 9), a < Xj < b, i = 1, 2, ...» n. Observe that 

i=i 

n n 

II /(x ( ,), 9) = n f(x h 9), since ..., jc (n) ) is a permutation of (Xj, ..., x n )'. 
1=1 1=1 

Thus we can write 

L(x, 9) = g 0 (x w , ..., x (n) , 9) h(x) (2.3.2) 

where h(x) = for x e A, 


= 0 o.w. 

where A, denotes the set of n\ permutations of ..., x (n) ). 

Since h(x) in (2.3.2) does not depend on 9 it follows that the order 
statistic (X (I) , ..., X ( „))' is sufficient for 9. 

Example 2.3.3 Let (A), ..., X n ) be i.i.d. geometric r.v. with/(x, 9) = 
Pe (X = x) = (1 - 9) 9 X , x - 0, 1 , 2, ..., 0 < 9 < 1 . Here, support S e is the 
set of all non-negative integers IN and does not depend on 9 . Since the r.v. 

X is discrete P e [X, = Xj] > 0 and we can not define the order statistic of the 

11 

sample. Consider T = X A, then T is negative binomial with parameters n 

1=1 

( — 

and 9 with pmf g 0 (t, 9) = P e [T = t] = (1 - Q) n 9\ t = 0, 1, 2, ... Now 

V 1 ; 

L(x, 9) = (1 - Q) n 0 Lt '' and g 0 (t, 9) = (1 - 9) n 9‘ = P<£A t ). 

V 1 J 

Therefore, we can write 

L(x, 9) = g 0 (T(x\ 9) h{x) 


where h(x) 



if X Xj = t or x e 


A, 


= 0 otherwise. 




26 A First Course on Parametric Inference 


Hence T- XX,- is sufficient for 9. We observe that P^A t ) - ^ f J (1 6f & 

and if A is any subset of S L = IN", then P£A I A t ) = P^A n A,)/g 0 (t, 9) = 
Z h(x) and thus the h(x) in fact defines conditional distribution of 

xeAnAt v ' 

(X) ... X, t ) given T - t. 

Remark 2.3.1 Consider Example 2.3.2. Here A, = {all n! permutations of 
{* (l) ,..., x in) } and as X is continuous r.v. PMt) = 0. Therefore the conditional 
probability of subset A of sample space (a, b) n given A t , Pe(A c\ A t )/P^A t ) 

is an indeterminate form of the type However using methods of measure 
theoretic analysis, we can give a meaning to this indeterminate form in a 

consistent manner. Heuristically we can interprete the function h(x) = —y, 
x e A, and zero otherwise as a discrete probability distribution on (a, b) n 
so that P d (A I A t ) = -7 (number of permutations of (x (1) , ..., x (n) ) belonging 


to A}. In the normal case discussed in Example 2.3.1 for T = X X, we have 

1=1 

Pe{A t ) = 0 and therefore we have the same difficulty in defining the conditional 
distribution on /? 3 . Using techniques of multivariate normal distribution one 
can show that the conditional distribution of (Xj, X 2 , X 3 ) given T = t is a 

singular multivariate normal distribution on /? 3 with mean vector f 4, 4> 


3’ 3’ 3 


1 1 


1 which has rank 2. 


and variance covariance matrix given by j 1 


i.e. the disribution is concentrated on a subspasce of dimension two given 
by x, + x 2 + x 3 = t. In both the cases above the conditional distributions are 
singular on the original sample space but can be interpreted as proper 
disributions on a lower dimensional space. 

Because of the difficulties involved in defining the conditional distribution 
of X given T in the continuous case we have defined sufficiency of T using 
the factorization given by (2.3.1), which we will call as Neyman Factorizability 
Cr,tenon. It is also possible to define T to be sufficient if the conditional 
distnbution of X given T is independent of 6 and then derive (2.3.1). On 
the Other hand if (2.3.1 ) holds one can show that the conditional distribution 

We wilTshlwTh - " 0t , Pe !! d ° n 6 a " d * huS the two defi nitions are equivalent. 

Rivel r c!n h H S f" a 6 ete C3Se Where co nditional distribution of X 
given l can be defined very easily. 

dbnritauitm A^eiven^^ giVe " ™ Defi nition 2.3.1, the conditional 

(2.3.1) holds ,S mdependent of 0 if ^d only if the factorization 



Sufficient Statistic 27 


Consider P e [X = x\T=t] = P e [X = x,T= t]/g 0 (t(x), 9) where go(t(x), 9) = 
Z L(x, 9) > 0. Then 

xeA t 

p ° lx=x ' T=,]= m£iD iix€A ’ 

= 0 o.w. 

Now suppose (2.3.1) holds then it follows that 

P e [X = x \ T = t] = h(x) if x e A t 

= 0 o.w. 

which is independent of 9 for each x and any t and therefore the conditional 
distribution of X given T does not depend on 9 . On the other hand suppose 
that conditional distribution of X given T does not depend on 9. Consider 
x e S L then as {A,} = T~ l (t) is a partition of S L there exists a unique t 
depending on x but not on 9 such that x g A t and then for each 9 g Q 

P e [X = x] = L(x, 9) = P e [X = x \ T = t] P e [T = t ] 

= W(x, t(x)) go(t(x), 9) 

since P e [X = x I T = t] does not depend on 9. Taking W(x, r(jc)) = h{x) we 
have the required factorization namely L(x, 9) = go(t(x), 9)h(x). 

This theorem can be used to show that a statistic T is not sufficient 
without explicitly obtaining the distribution of T as the following example 
shows. 

Example 2.3.4 Let (Xj, X 2 ) be i.i.d. Poisson with parameter A and let 
T = (X { + 2X 2 ). Consider 7 = 0 then A 0 = (0, 0) and P e [X = x I 7 = 0] = 1 
if x = (0, 0) and zero otherwise, which is independent of A. However 
consider 7=2 then A 2 = {(2, 0), (1, 1)} 

P[Xi = 2, X 2 = 0) I 7= 2] = e- 2 * ^ 

= A/(2 + A) 

which depends on A. Therefore (X\ + 2 X 2 ) can not be sufficient for A. 

Example 2.3.5 Let (X l5 X 2 ) be i.i.d. b( 1, 9) then it is easy to show that 
(X[ + X 2 ) is sufficient for 9 by using Neyman factorizability criterion. 
Consider T, = X\ + 2X 2 . We can show that 7 is sufficient by arguing that 
(jq, x 2 ) and (*| + 2x 2 ) are in 1:1 correspondence since (0, 0) <-> 0, (0, 1) 
<-> 2. (1, 0) <-> 1 and (1, 1) <-> 3. Note that these points constitute the 
sample space of (X|, X 2 ). There are many points at which (jq, x 2 ) and 
(a'i + 2jc 2 ) are not in one-one correspondence but the set of these points has 
probability zero. Further 7 2 = X\ + 3X 2 is also in 1:1 correspondence with 


(2.3.4) 




28 A First Course on Parametric Inference 


(X h X 2 ) and with T, on the sample space of (X h X 2 ). One can generalize 
this result for T = aX\ + bX 2 where a & b. 


Exercise 2.3.1 (1) Let (*„ X 2 , X 3 ) be i.i.d. b(l, 0 ) and let T = *i + 2X 2 + 32T 3 . Show 
that (ATi, X 2 , X 3 ) and T are not in 1:1 correspondence as T (3) - {(1, l, 0; (U, 0, 1)}. 
Obtain conditional distribution on the sample space given T = 3. Does this depend on 
07 Show that except for the above two points T and (X x , X*X 3 ) are in 1:1 correspondence. 

(2) Let X[, X 2 be i.i.d with pmf given by./(0, 0) - e ,J(l, 0) = 0e ,A 2, 0) = 1 - 
e -e_ Be 9 . This pmf can be obtained from the Poisson distribution with mean 0 assuming 
that all observations greater than or equal to 2 are recorded as 2. Thus if X is Poisson 
we define Z = Min (. X , 2) then the pmf of Z is the pmf given above. Such an operation 
of recording observations is known as right censoring. We have seen that in case of non- 
censored obervations (*, + X 2 ) is sufficient for 0. Show that in the above censored case, 
P[X, =01 X t + X 2 = 2] depends on 0 and ( X ] + X 2 ) is not sufficient. 


\ 

1 


) 


2.4 Obtaining Sufficient Statistic 

The Neyman Factorizability criterion given in the previous section enables 
us to check whether a given statistic, real or vector valued one, is sufficient 
or not. However, it does not provide us any method to obtain the sufficient 
statistic which will allow maximum possible reduction of data. For example 
for a random sample of size n from N(6, 1) we have (X \,..., X n )' observations 
themselves as well as the order statistic (A^j), ..., X^Y and X Xj are all 
sufficient for 6. Note that X Xj is a further reduction of order statistic as 
it can be expressed as a function of (X^, ..., X (n) Y since X *, = X *(,> A 
question naturally arises as to whether any further reduction of X Xj is 
possible. We will answer both of these questions by providing a construction 
to obtain a sufficient statistic which represents maximum possible reduction 
of the data. As before our basic frame work remains the same namely (ATj, 
X n Y are i.i.d. with pdf (pmf) belonging to the family {/(*, 0), Be Q}. 
We allow x and 6 to be both vector valued of dimensions k and m although 
emphasis initially would be on the case k = 1 and m - 1. 

Let be the sample space of (Aj, ..., X n Y and let T(x) be a statistic, 
a function which maps aY n) into the range space Recall that a function 
T induces a partition of 8? {n) given by { A t , t e i.e. A n nA h = </> and 
u A, = 8? l} where A, = {x I T(x) = t). Conversely any partition {B m u e 

re/ 

2 /) defines a function u which takes different values on the disjoint subsets 
B t and B k . Thus a function induces a partition of %Y n) and can also be 
defined by a partition of On the other hand a partition of %Y n) defines 
and is in turn defined by an equivalence relation on sY n \ Thus a function 
T from to T can be defined by an equivalence relation. For a function 
Ton we say that x € is Tequivalent to y e written as x T y 
iff T(x) = T(y ) and this defines the partition {A t , t e t}. On the other hand 
if { B w , u e 21} is a partition of 8?^ then we can define a function on 
such that on B u it takes the value u or U(x) - u iff x e B lr 


Sufficient Statistic 29 


Now consider joint pdf of (Xj, X„)' given by L(x, 9) = II f(x h 9) and 

f=l 

let S L be the support of L(x, 9), as defined in the previous section. We now 
define a partition of & ,l) using L(x, 9) as follows. First notice that {S L , S£) 
is a partition of & n \ Now we define a further partition of S^. Let x and y 
e S^. Then we say that x is likelihood equivalent to y, denoted by x ~ y 
if there exists a positive constant k\, independent of 9 (but possibly depending 
on x, y) such that L(x, 9) = k x L(y, 9) for all 9 e Q. i.e. considered as 
functions of 9 for fixed * and y, L(x, 9) and L(y, 9) are proportional to each 
other and we denote this by x~y. We observe that ~ is an equivalence 
relation on S L . For x ~ x as we can take kj = 1. If x ~y then y ~ x as we 

have L(y, 9) = y- L(x, 9). If x ~ y and y ~ z then L(x, 9) = k\k 2 L(z, 9) 

and x ~ z. Thus ~ defines an equivalence relation and a partition of S L and 
together with S C L defines a partition of & n \ We point out the fact that if 7j 
and T 2 are such that they are in one-one correspondence then they define 
the same partition of & n) . 

Example 2.4.1 Let (Xj, X 2 , ..., X„) be i.i.d. with pdf given by 

f(x, 9) = 9b x > 1, 9 > 0 


This is the well known Pareto distribution used as a model for income 


distribution. Here L(x, 9) - — . , , 

(nxf) e+x 

Sf = [1, oof = S L and S C L = 0. 


Xj > 1, i = 1, 2, 


..., n, 9 > 0, 


Now let x, y e S L then x ~ y iff L(x, 9)/L(y, 9) is independent of 9. 


Consider |y = (nxf) m l(ny,) m = and is independent of 9 for 

-every 9 > 0 iff 7 tXj = ny v Thus ~ defines the funcion T { (x) = nxj on & n \ 
Consider T 2 (x) = I log x t = log T { . Then T { and T 2 are in 1:1 correspondence 
as T { (x) = exp (T 2 (x)). Note that the ranges of T { and T 2 are different but 
they induce the same portion of ^'°in the sense that If 1 (a) = T{ 1 (log a) 
and T{\b) = T{\e b ). Is T, or T 2 sufficient? To use Neyman Factorizability 
criteria we must obtain the pdf and it is easy to work with T 2 . Noting that 


log X is exponential with mean 1/0, we have T 2 ~ g( n , \ or 

U 


go(t 2 , 9)- 9 n 


T(n) 


e~ ,2e t 2 ~\ 


t 2 > 0, 9 > 0 


Hence L(x, 9)/g 0 (t 2 (x), 9) = 


£(ri) 

(TiXi) (L log x,)" _l 


Xj > 1 which is independent 



30 A First Course on Parametric Inference 


of 9 and therefore T 2 is sufficient. Since 7 1 ,, T 2 are in one-one correspondence 

Ti is also sufficient. . . 

Recall that the pdf of T, can be obtained from that of T 2 by using the 

rule of transformation given by 

g(t h 0) = goihVO’ 0 ) ^ 

= go (log h* 0) as *2 = log h 


= rk*'*'*" ( i°* r' 1 -' 


= n«) ° <i 

r(«) 

Hence L(x, 8)/g(t,(x), 8) = p . */ 2 1- 

Showing that Tj = nX t is also sufficient. 

t ^ 1 ST ^ j 

Example 2.4.2 Let ( X { . X n ) be i.i.d. with pdf given by/(x, 9) = -g, 

0<Xj< 9. This is the well known uniform distribution over the interval (0, 
9). Here the range of r.v. X depends on the parameter 9 and S d = (0, 9) 

and as £2 = (0, ©°) Si = (0, °°)” and L(x, 9) = 0 < Xj < 9, i = 1,2, ..., n. 

U 

Let x € S L and 9 e Cl. For each x- } < X( n) = max (jq, ..., x n ) we can write 
L(x, 0 ) = -h (A V'fo* *(«>)} V'Cfy). #) where ¥(<*> *>) = 1 

U i=l 

if a < b and zero otherwise. 

Let x € .S^ and y e Sj, and let y (n) = max (y l5 ..., y n ). Then 

L(y, 6) = {n >(»>)} irOw 0) 

C7 f=J 

Comparing L(x, 9) and L(y, 0) we observe that x ~.y iff \f/(x {n) , 9) = 
y/(y( n ), e ) for all 0 e (0, <*>). Suppose that x in) *y {n) then either we must have 
X(n/< y(n ) or x (n) > y (n) . Consider the case x (n) < y (>0 then for 9 e (x M> y ln) ), 
Vr(x (w) , 9) = 1 but y(y( n ), 9) = 0 and x is not likelihood equivalent to y. 
Similarly for the case y (n) < x (n) we have for 9 e (y (n) , x (n) ), yr(y (n) , 9) = 1 

but Y( x oi )’ 0) s 0 and x is not likelihood equivalent to y. Thus x ~ y iff 

x (n) = y(n) or the likelihood equivalence defines the statistic T-X" = Max 
{X\, ...» X n ). 

n-\ 

nx ( x 

To verify that X (n) is sufficient we note that go(X( n) , 9) = — y/(x (lt) , 9) 
and therefore we can write L(x, 9) = go(x M , 9) h(x) where h{x) = 

II 

n y/(Xj, x (n) )/nx" n) l which is independent of 9. Hence X (n) is sufficient. 





Sufficient Statistic 31 


Exercise 2.4.1 (1) Let (X,.X„) be i.i.d. exponential with location 6 having pdf 

f(x, 6) = exp {- (x -$)},x>6 . Show that the likelihood equivalence relation defines 
the statistic T = X (l) = Min (X, .... X„) and T is sufficient. Note that g 0 (x 0) , 0) = 
n exp {- n(x (1) - 0)}, x (1) > 0. 

(2) Show that for a random sample of size n from the population with pdf f{x, 0) = 
ftc 8-1 ,0 < jc < 1, 0 > 0 the likelihood equivalence defines the statistic T = X log X, which 
is sufficient. What is the connection between this problem and Example 2.4.1? 

(3) Show that for a random sample of size n from b(l, 0), x ~ y iff X *, = X yj and 
that T = X X, is sufficient. 

(4) Show that the result of Exercise (3) above also holds for the Poisson distribution 
and the geometric distribution. 

Example 2.4.3 We now consider an example in which the ~ leads to a two 
dimensional sufficient statistic for a real parameter 0. Let (Xi, X 2 ) be i.i.d. 
Cauchy (0, 1) with pdf 

f(x ' 6) = n 1 + (*_ 8) 2 ' X 6 Rh 6 £ Rl 
Here S L = R 2 and Q = Ry. Now x ~ y iff for any 0 e R[ 

[i + o-i - 0) 2 i [i + cv 2 - e) 2 ] = m + (*1 - o) 2 ] a + fe - s) 2 ]. 

where k does not depend on 6. Expanding both sides and comparing coefficient 
of 0 4 we get k = 1, and using k = 1 and then comparing coefficients of 0 3 
we get yj + y 2 = + x 2 . Using these relations and equating coefficients of 

0 2 we get x\x 2 - y\y 2 . Comparing coefficients of 9 and the constant term, 
we have the equations 


yi ( 1 + yi) + fcfl + ^i 2 ) = Xi(l + X 2 ) + jcj(1 + X 2 ) 

(1 + y, 2 ) (1 + y 2 ) = (1 + x f) (1 + x\) 

The last two equations hold when X[ + jc 2 = y, + y 2 and = y,y 2 and 
do not lead to any further restriction on x, y. Thus x ~ y implies that either 

*1 = y 1 * x 2 = yi or x i = *2 - y, or y is a permutation of x. Thus ~ relation 

defines in this case the order statistic (X(i), x (2 )), since the order statistic 

induces the same partition of R 2 as the permutation of (x lf x 2 ). We have 
already seen that (x (1) , x (2) ) is sufficient for 9. 


Example 2.4.3 In this example we show how ~ defines a two dimensional 
sufficient statistics for a two dimensional parameter. Let (X. ... X ) be 


1 "1 

n 

PYT> d 

f-I(x, - 0,) 2 ' 

^2^0 2 > 

CAp 4 

l 20 2 j 


. Consider x, y e R n and 



32 A First Course on Parametric Inference 


log L(x, 0) - log L(y, 0) = {X CV/ — ^l) 2 “ ^ (■*/ - ^i) 2 } 

2 

= (Zx*-2*?)^ + £(Z*,-I>0 

This does not depend on Q\ or 0 2 iff X* 2 = Xy,- 2 and X*/ = Xy,-- Hence ~ 

defines a two-dimensional statistic (T lt T 2 ) where = X and T 2 = X^, 2 . 
Marginally Tj ~ yV(0 ls n0 2 ) and X T 2 is noncentral with n degrees of 

freedom and non-centrality parameter 8 - nO 2 /0 2 . The joint distribution of 
(T h T 2 ) is not easy to obtain. Hence we consider a one-one transformation 

of (77, 77) given by 77 = X = ^- and T 4 = I (X, - X) 2 = T 2 - T, 2 /n 

Yl 


8o(h> ^i) 0 2 ) = 


0 


n 


2nd 


exp 


- n(t 3 - 0,) 


20 


e~ t4l2e2 1 4 2 


-1 


f n - 1 


( 20 ) 


(n- 1)/2 




since T 3 ~ A^0 h —J, T 4 ~ 0 2 £ 2 _ 1 and (T 3 , T 4 ) are independent. Therefore 
L(*, 0 j, 0 2 ) = goih’ &b 0 2 ) h(x) where 


h(x) = 


n- P 
2 > 


nil 


' !{2n)U ~ m ff („ - x )^- 1 


This shows that (T 3 ; T 4 ) is sufficient for (0 b 0 2 ) and as (T h T 2 ) is in one- 
one correspondence with (T 3 , T 4 ) we have (T h T 2 ) is also sufficient. 

Now we show that the likelihood equivalence always leads to sufficient 
statistic which represents the maximum possible reduction of data. Let T 
and T 2 be two sufficient statistics then T 2 represents a further reduction of 
r, if 77 can be expressed as a function of 7) i.e. T 2 (x) = (fXT^x)). In terms 
of partition of the sample space 7) induces a finer partition than that induced 

T, I- r^'V^ Tl iS 3 fUnCti ° n ° f T ' if T ' ix) = ri(y) implies that 
2 , 2 O’)- To illustrate, consider Example 2.3.1 where we had a random 

sample of size three from N(6, l). As seen there T, = (A - ,, X 2 + X,Y is 

tf T+ y”y Tl 7 (A "' 1 X - 2 + * s) ' S a function of r <- Observe that 77 = 

a funcdon nfT - 15 als0 T SUfflC ' em and Tlis a function of *3 but neither 7", is 
unction of 77 nor 77 is a function of T,. Thus T 2 is a reduction of T, and 

T-h both. Note that T 2 = 1 (T, + j 3 ). 

sufficiem^tatistic ^ maXimum possible reduction of the data or minimal 



Sufficient Statistic 33 

Definition 2.4.1 M( x) is called a minimal sufficient statistic for 6 provided 

(i) M(jc) is sufficient for 0 and 

(ii) If T(x) is any other sufficient statistic then 

T(x) = T(y) implies that M(x) = M(y). 

t 

Thus the minimal sufficient statistic M(x) is a function of any other 
sufficient statistic and therefore represents the maximum possible reduction 
of data. We now show that the likelihood equivalence leads to minimal 
sufficient statistic. 

THEOREM 2.4.1 Let T be sufficient for 0 then for any x, y e S l T(x) = 
T(y) implies that x ~ y. 

Since 7 is sufficient by Neyman factorizability criterion for any x,y e S L 
for every 0 e Q we have 

L{x, G) = g 0 (T(x), G) ■ h(x) 

L(y, 0) = g 0 (T(y), G) h(y) 

But T(x) = T(y) implies that g 0 (T(x), G) - go(T(y), G) and therefore L(x, G) 
= k L{y, 6) where k does not depend on G and thus x ~ y. 

As seen earlier the likelihood equivalence defines a statistic M(x) i.e. 
M(x) = M(y) if and only if x ~ y where xje S L . We now show that M(x) 
is a sufficient statistic by proving that the conditional distribution on the 
sample space given M{x) does not depend on G and then use Theorem 
2.3.1. We will consider the discrete case only. 

THEOREM 2.4.2 Let the likelihood equivalence define a statistic M(x) 
then P e [X = x I M(x) = m] does not depend on G for any G e Q and every 
m in the range of M. 

Let A m = {x I M(x) = m) then consider 

P e [X = jc I M{x) = m] = - — 


L(x, G ) 
r Uy.e) 

veA m 


if xe A m 


= ° xt A m 

However for x and y € A m we have x ~ y and L(y, G) = kL(x, G) where k 
can depend on x, y say k = k(x, y) then 


P e [X = x I Mix) = m] = 1 -- = h(x) 

L k(x,y) 

V€A m 


which is independent of G. 



34 A First Course on Parametric Inference 

By Theorem 2.3.1 the statistics MW induced by Mood equivalence 
is sufficient for 0 and it is n»n.mal sttffi«en a by T ^ 

function of any other sufficient s.ausuo Theh^ ^ J ^ 

gives us a constructive me sufficient and T 2 is in one-one 

We have already seen that if T { is surnciem au 2 

correspondence with T, then T 2 is also sufficient. In the same way we pomt 

outthatTf the family (/W * * - «) is reparametnzed y —nng a 

one-one transformation of 0 to <p then if T, is sufficient for 0 then 

continues to remain sufficient for <p and thus sufficiency is in fact a 

property of the family of distributions which can indeed be labelled or 

indexed by different parametrizations. To illustrate this point consider (X t , 

i.i.d. b(\, 0) and consider tp = log y~~$’ the so called log ~ odds ln 

favour of occurrence of an event E. Now ^ ^ ^ and ^ and 0 

are in one-one-correpondence. Further 0 = ] and as 0 varies over (0, 
1), <p varies over (- °°, “)• Now we have L(x, 0) = 0 Z< ' (1 - 6)"^' which 


* 


- 


Now 


gets transformed to ^ J -£[(*, 0) - * ■ . ^ + • 

consider likelihood equivalence in terms of (p. We then have * ~ y iff 
_ fe&i for all (p e Ri which holds iff X *,• = X y,-. Looking at 
Neyman Factorizability criteria the relation L(x, 9) = g 0 (T(x ), 9) h{x) gets 
transformed to L\(x , <p ) = go{T{x), yKQ)) K x ) where 0 = y/(<p) is the inverse 
transformation. 

In the next section we now consider some of the important class of pdfs 
which occur quite often in many practical situations and which admit minimal 
sufficient statistic of the same dimensionality as that of the parameter. 


2.5 Exponential Family 

First consider the case of a real parameter 9. Let X be a real or vector 
valued r.v. with pdf belonging to the class { f(x , 9), 9 e Q o R\) where 
the following regularity conditions hold. 

C[ : The support S e = [x I f{x, 9) > 0} does not depend on 9 i.e. Sq = S, 
V 9 e ft. 

C 2 : The parameter space £2 is an open interval of R it 9 < 9 < 9 . 

C 3 : The pdf /( x, 9), is such that for x e S and 9 e £2, log f(x, 9) = 
u(9) k{x) + v(9) + w(x) 

where (a) u(9) is twice differentiable with * 0. 

act 

(b) {1, fc(*)} are linearly independent over S i.e. a 0 + a\k{x) = 0 
v x € S iff a 0 = 0 and a x = 0. 



Sufficient Statistic 35 


If the above conditions hold for [fix, 0), 0 e £2} then we say that the r.v. 
X or its pdf is of the exponential type or it belongs to a one-parameter 
exponential family. Many distributions used as models for practical situations 
belong to exponential family and the reader should verify that the families 
[N(9, 1), 0e R x }, {Poisson (0), 0> 0} and {exponential distribution with 
mean 0, 0 > 0} are of the exponential type. As examples of family of pdfs 
which are not of exponential type we first consider U{ 0, 0). Here S e - 
(0, 0) and therefore C x does not hold. On the other hand consider Laplace 


distribution with pdf fx, 0) = ^ exp { -1 x - 0 I }, x e R\, 0 e ^i- Here 

Ci and C 2 hold but log fx, 0 ) = - log 2 - I x - 0 I can not be written in 
the form u(9) k(x) + v(0) + w(x). Cauchy distribution with pdf f(x, 0) = 


1 1 
n 1 + (x - 0) 2 

as in the case of Laplace distribution. Ci and C 2 hold but C 3 does not hold. 


, jc e 0 e R t is also not of the exponential type since 


C7 

Example 2.5.1 Consider the Pareto distribution with pdf f(x, 0) = — 

x > 1, 0 > 0. Here S e = (1, and Q = (0, <*>) and C x and C 2 are satisfied. 
Consider log fix, 0) = log 0 - (0 + 1) log x then «(0) = - (0 + 1), k(x) = 

1 1 2 

log x, = - 1 and = 0, so C 3 (a) holds. Consider a 0 + a x log x - 
att dS* 

<p(x) for x > 1. Taking x-e and e 2 we have a 0 + a Y = 0 and a 0 + 2a { = 0 
which implies that a 0 = 0 and a x = 0 and C 3 ( b) also holds. One can show 
the linear independence of {1, k(x)} over 5 in a variety of ways. For 


example (p(x) = 0 for x > 1 then (p\x) = 0 , V x > 1 and as (p\x) = £i = 0 , 

x 


V x > 1 we have a x = 0 and therefore a 0 = 0 . Thus /(x, 0 ) = x > 1 . 

JC 

0 xO belongs to an exponential family. 


Example 2.5.2 Let X be a geometric r.v. with pdf given by f(x, 0) = 
0(1— Of , x — 0, 1, 2, ..., 0 < 0< 1. Then 5 = {0, 1, 2, ...} which does 
not depend on 0 and £2 = (0, 1), is an open interval of R { . further log 
f(x, 0) = log 0 + x log (1 - 0) so that k(0) = log (1 - 0) and k(x) = x. Then 

dti J ^2 ^ 

Id = (1 - Q) and ~dQ^ ~ (j _ ^2 and C ^ a ) hold S- Further a 0 + a x x = 0 

for x — 0 , 1 , 2, ... implies that <Jq = 0 * a \ — 0 as we can take x = 0 to 
conclude a 0 = 0 and then x = 1 to claim a x = 0 . 

J&ample 2.5.3 There are situations in biological assays when the probability 


of a favourable response at dosage level d > 0 is given by y/(p) = - 

= 11 = and m = 0] = 1 - V (fi) Here j3 6 R, and S i fo! U 
and C b C 2 are satisfied. Further 


log/(x, p) = x log y/(p) + (1 -x) log [1 - y/(p)] 



36 A First Course on Parametric Inference 


= xX °ST^W)+ Xog[X ~ vm 

= x d p - log (1 + eP) 

d^u 

Thus k(x) = a : and «(/3) = dp. Again = d * ° and -jp = 0 and 

a 0 + a lX = 0 for x = 0 and 1 implise that a 0 = 0 and a < = Therefore 

{/(*, p), p e R\} is an exponential family. 

Let (X lt ..., X n ) be i.i.d. with pdf belonging to {/(*, Of 6 <= 12} which 
forms an exponential family. Then the joint pdf of (X\ t ■■■, X n ) is given by 

L(x, 9) = fl fix j, 0). Consider the family {L(x, 9), 6 e 12} then S L = S n 

and the parameter space for L is 12, same as that for f. Further for x & S 
and 6 e 12 we have 

log L(x f 6) = uiO) 2 kixj) + nvid) + I w(*,) 


Therefore defining Tix) = 2 &(*,) we have {L(x, 9), Oe 12} is an exponential 

i=i 

family provided a 0 + a{Tix) = 0 for x e S n implies that Oq = 0 and a\ = 0. 
Now consider;^ e 5, (* J, e S" _1 where (* •••>*)?) is a fixed 

n 

point. Then <z 0 + fiiT(x) = tfo + 2) kix?) + But 

i=2 

as {/(*i, 0), 0 € 12} is an exponential family we have b 0 + aikixf) = 0 for 
V *] e 5 implies that bo = 0 and = 0. This in turn implies a\ = 0 and 

= 0. Therefore {L(x, 0), 0 e 12} is also an exponential family. 

We now show that T = X &(*,) is a minimal sufficient statistic. Let 

x,yeS L then if * ~ y, log L(*, 0) - log L(y, 0) = w(0) [X £(*.) - X £(y,)] 

+ [2 vv(*,) - X w(y f )J should not depend on 9. Therefore taking derivative 

w.r.t. 9 and observing that * q we have X &(*,■) - X £(y,) = 0. 

Therefore * ~ y iff X *(*,) = 2 %,) and as the likelihood equivalence 

leads to a minimal sufficient statistic, T=l k( Xj ) is a minimal sufficient 
statistic. 

We further show that the pdf of T is itself of the exponential type. 

Consider a transformation r, = £ k( X ,) and r 2 = * 2 .Then the 

joint pdf of (fj, ..., t n ) is given by 


#(^i» ...» t m 9) — Lix , 9) 


<?(*!»..., t n ) 
d <Xl’-,X n ) 


ex P {uifyh + nvi9)) exp {X w(A,(/)} 


<?(*i,...»*„) 



Sufficient Statistic 37 


where x, = hfr) i = 1, 2. n is the inverse transformation. Thus gft 0) = 

exp {n(0)fi + m>(0)} Pih, •••» U- 
Therefore 

g 0 ft> 0) = exp { u(6)t\ + m>(0)} J pit i, ...» t n ) dt 2 ... dt n 
= exp {u(0)r! + m>(0) + vi^ft)} 

where v^ft) = log [J pft, ..., ••• dt n ]. Note that exp {I •(*))} 

^ft>..? H ) 
0(^1, • • •» ■*■«) 

expressed in terms of ft, .... t n ), therefore pft, ...,*„)> 0 and its integral 
is positive. It then follows that {g 0 ft> 0), 0 € £2} is an exponential family. 

These results can be generalized for the case of multiparameter exponential 
family which we define next. The condition C-l remains the same. Since 
0 = (0 l5 ..., 0 m )' we now change C-2 as 

C -2 : 0 g ft is an open set of R m containing an m-dimensional rectangle 
0, < 0, < 9j, i = 1 , 2, ..., m. 

Similarly C-3 is changed to 

m 

C'-3 : log /(x, 0) = X u,(0) + n(0) + w(jc) 

r=l 


> 0, since we take the absolute value of the Jacobian 


where 


* 0, V 0 g ft 


(a) m,(0), / = I, 2, .... m has partial derivatives of order two and the 

Jacobian "" “ m) s * 0, V 0 s n. 

(b) {1, k { (x), ..., k m (x) } are linearly independent over S. If the above 
conditions hold then we say that {/(*, 0), 0 g ft} is an m-parameter 
exponential family. The reader can verify that the classical normal distribution 
N(0\, 0 2 ) or two parameter Gamma distribution with pdf f(x, 0, X) = 

~qT e Xl ^ •* > O>0 > O, A>0 form two parameter exponential 
families. 

We now show that if (X h ..., X n )' are i.i.d with the pdf belonging to m- 

parameter exponential family then (r lt ..., TJ where T r (x) = £ k r {x,) is 
a minimal sufficient statistic. Consider x, y g S l then x ~ y iff 


m n 

log Ux, 9) - log L(y, 0) = Z u r (8) T r (x) + I (wte) - w(y,)) 

r=l /=1 

is independent of 0, this holds iff derivatives of RHS w.r.t. 0„ i = 1, 2, ..., m 
are all zero. Thus we must have 



38 A First Course on Parametric Inference 


2 du r (6) j = 0, / = 1* 2, 

r=l 


W 


However, as 


d(.U\, . .*» ^w) 


^ 0 we must have T^C*) — 7VOO for r = 


I <?(01> •••» ^w) , ^ 

1,2 . 2 . 

defines an m-dimensional minimal sufficient statistic 

(X /t t (x/), .... 2 W*/))' = (T ' . Tm) '- 

As in the case of m = 1, we also have the property {«o('i.* 15 ^ 

is itself an m -parameter exponential family with pd 

(?„('„ .... tm « = e *P <| *' + W(0) + .. U} 

Example 2.5.4 Consider the classical regression problem.n which we 

have the response y, = « + fit, + * ‘ = 2 - •••• n w ^ e £/’ "V 

fixed levels and e,’s represent random errors and are l.i.d. MO, a ). Then 

the joint pdf of (Tj. YJ is given by 

/- \n ( „ , f }„ \2l 


L(y, a, A G 2 ) = 


V 


4l7tG 


J 


f y 0/ - 01 - pXj) 2 

exp {-1 - 2o* j 


Here S L = R„, Q = R, x /?, x /f + and C, and C 2 ' hold. Now 

, , A (a+/fe ,) 2 Xy 2 Xy,(« + /fr,) 

log £(y, a, A tf 2 ) = -f *°g - ,5 2a 1- ~ 2^ a 2 


n (cf + (5 xD 

rherefore we can take u(a, A (j2 ) = “ y 2^cr 2 - X ^^.2 
x b x n ) are fixed known constants. Further w(y) = 0 and 

«i(a, A tr 2 ) = ^3 7i(y) = X y, 2 
H 2 (a, A ff 2 ) = -% ^(y) = 2 y. 

<7 

« 3 (a, A cr 2 ) = A T 3 (y) = X x,y, 


as 


Now 


d(u\, Uj, u$) 


d(a,p, a 2 ) 



0 0 1/2c7 4 

I/O’ 2 0 - a/tr 4 

o l/ff 2 


2<t 8 


> 0 



Sufficient Statistic 39 


Consider a o Z y ,- 2 + cq Z y,- + 02 £ + « 3 = 0, V y e /?„• Take y = 0 then 

we have a 3 = 0. Take (y, 0, 0) then we have a 0 y 2 + y(«i + <* 2 * 1 ) = 

V y e R\ which implies that ao = 0, and a\ + < 22*1 = 0. Next take y = 
(0, y, .... 0)' we have ay + a 2 x 2 = 0. If x\ *x 2 this immediately implies that 
= 0, a 2 = 0. Thus under the additional condition (often implicitly assumed) 
that there is at least on pair (x h xj) where x, *■ x } we have {L(y, a, P, o 2 ), 
(a, j 3 ) e /? 2 , o 2 € /?+} is a three parameter exponential family with 

(Z y, 2 , Z y,-, Z Jc,y f )' as minimal sufficient statistic for (a, ft, cr 2 )'. The 
classical Lest Square estimators 

^ = (Z x,y,- - x Z y,)/Z (*, - x ) 2 

and a = y - fix and <7 2 = —(Z y. 2 - a Z y, - P Z x ( y,) 

A? ^ 

are functions of the minimal sufficient statistic. Observe that the condition 
jc ( - *xj for at least one pair (i, 7 ) is equivalent to Z (x,- - x ) 2 > 0 . Note that 
if all x,’s are equal to x then (Yy, Y n ) are i.i.d. N(a + fix. O' 2 ) and 

(a, p, cr 2 ) is no longer a labelling parameter as we can determine (oq, Py) 
and ( a 2 , A) such that ay + Pyx = oq + A* but cq ^ cq and A p 2 . 

Example 2.5.5 Consider a bivariate discrete r.v. with pmf 
/(AT. y, X, p) = (*) p>( 1 -p)'-'- 

W x ' 

y = 0 , 1 , 2 , ..., x, x = 0 , 1 , 2 , ..., A. > 0 , 0 < p < 1 . 

This model arises when conditionally (Y I X = x) ~ Binomial (jc, p) and 
marginally X ~ Poisson (A) which is used in variety of situations. For 
example let X represent the number of telephone calls received in an exchange 
and Y denote the number of calls which are wrongly connected. Here, S the 
range of pmf does not depend on 0 = (A, p ) and ft = ( 0 , «>) x ( 0 , 1 ) and 
C-l and C -2 both hold. Further 

!og/= log I^J - log x\ - log A + x[log A + log (1 -p)] + y[log /? - log (1 -/?)] 
and we have 

P) = log A + log (1 - p) ky(x, y) = x 
P) = log p - log (1 - p) k 2 (x, y) = y 

Now, 

dQq, M 2 ) _ 1 ^ - 1 /( 1 -/?) _ 1 

<?(*,/>) " 0 !//?(!-/?) ~ A /?0 - /?) > ° 



40 A Firs, Course on Parametric Inference 

■A A * - 1 x + a,y = o for all (x, y) e S. Take (x, y) = (0, 0 ) to 
Next constder a 0 + a,x ^ = 0obtajn a , =0 which then impl ies 

.Tfo^uste Jmf belongs to two parameter exponent family and 
(Z Xn £ 1';)’ *s minimal sufficient statistic for ( , p) ■ 


Exercise 2.5.1 (4) Le< f(x, o) = ^ exp (- I x I/O), x_ 6 _*>.O > 0_Show that 

[j{x a), o> 0} is a one-parameter exponential family and X X, Gin, l/o) j s 

minimal sufficient statistic for <x. . , , 

(2) Let f(x 0) = Ox 9 - 1 , 0 < x < 1 and 0 > 0. Show that this pdf belongs to one 

parameter exponential family. Show that I log X, is minimal sufficient and obtain its 


distribution 

- /3) Let (X,, ..., X„) be a random sample of size n from Nip., G , ) and ( Y { , Y 2 , ..., Y m y 
be a random sample of size m from Nip 2 , g 2 ) show that joint pdf of (X, Y) belongs to 
four parameter exponential family and (X, Y, S 2 , S 2 ) is minimal sufficient for [p x , ^ 2) 

<r 2 , <7 2 ) where S 2 = X (X, - X) 2 , S 2 ='L(Y i - Y) 2 . 

(4) If in (3) above cj 2 = cr| = or 2 show that we have a three parameter exponential 

family with (X, Y, S 2 + S 2 ) as minimal sufficient statistic. 

(5) If in (3) above //, = p 2 = ju then the joint pdf of (X, Y)' is not an exponential 

fanjily but (X, Y, S 2 , S 2 ) is minimal sufficient for ip, a 2 , erf). 


>J( 6 ) Consider a power series distribution with pmf given by fix , 0) = 


a(x)0 


bid) 


00 

x = 0 , 1 , 2 , ..., 0 < 0 < p the radius of convergence, aix) > 0 and biO) = X a(;t) 0 r . 


Show that [fix , 0), 0 < 0 < p} is an exponential family. Many distributions such as 


Binomial in, 0), Poisson ( 0 ), negative binomial with fix, 0) = 


x + k - 1 


jc- 1 


0*(1 - 0 f 


with k known, logarithmic series distribution with^fr, 0 ) = - 1 _ x = 

2 , ..., 0 < 0 < 1 are examples of power series distributions. ^ 


(7) Let [fix, 0), 0e ft} be a m-parameter exponential family and let A be a subset 
of S where A does not depend on 0. Consider the distribution truncated to set A with 


pdf given by f A ix, 0) P e (A) ’ x e ^* 10w that {/ 4 U. 0), 0 € ft) is again an m- 
parameter exponential family. 


2.6 Pitman Family 

This section considers the case where (X v \ ; c a , r 

size n from {fix 0) 0 P n r- d\u V" * n) random sample of 
depends on 9 Tk. ^ n c «il where the range S e = {x l fix, 6 ) > 0) 

distribution over (0 P 0) m a e nd XamPleS ° f 3 fam ' ly are U( °’ 6) ’ Uniform 
Ax, 0) = exp I! _ ! , a " d expo " ential distribution with location 6 with 

*at for {£/(0 0 \ a . * °y X > We have seen earlier in Example 2.4.2 

reader can verify’that for* ' S m,nimal sufficient statistic for 0 and the 

JZZTTT" wl,h 9 

ft The pdf of U( 0, S) has the form f(x, 6 ) = 1, 

6 


i 


Sufficient Statistic 41 


0 < x < 9 and here the upper limit depends on 9 where as for exponential 
distribution with location 6,f{x, 9) = e~ x /e~ d , 9 < x < °° and the lower limit 


depends on 9. In both cases the pdf has the form 
this situation to define Pitman family. 


m(x) 

v(9) 


over Sfr We generalize 


Definition 2.6.1 Let [fix, 9), 9 e il c /?[} be a family of pdfs such that 

ffix, 9) = for a(B)< x < b(9) where a(9) < b(9) for every 9 e Q and 

Q. is an interval (9, 9) then [fix, 9), 6 e Q c } is a Pitman family. 

(A) We now show that for a(9) = a, a constant, X (n) is minimal sufficient 
for 9 and for the case b(9) = b, X (1) is minimal sufficient for 9. 


Now L(x, 9) = 


1 


MW w 


II u(xj), a < Xj < b(9), i = 1,2, 


n 


- TT^nr ^ V<*»’ a) yKb(9), x,)], x e R n , 9 e Q 

[v(9)] n '=i 


where d) = 1 if c > d and zero otherwise. Noting that n \ff(b(9), x,) 

i=i 

" 1 
~ n V^(^(n)> Xj) mo), -^(n))’ L(X f 9) — - [7tu(xj) \f/(Xj, Cl) X;)] 


;=i 


MW 


x y/(b(9), X( n) ). Let x e R n , y e R n , then x ~ y iff 


y/(b(9), x (n) ) = Y(b(9), y (n) ) (2.6.1) 

Now if x (w) ± y (n) then we can determine a value of 9 such that b(9) is 

between x (n) and y (n) , and (2.6.1) can not hold. Hence x ~ y iff x (ll) = y (n) 

or ~ determines the statistic T = X (n) . As already seen in Example 2.4.2, 
X (n) is minimal sufficient for 9. 

(B) In a similar way if 

fix, 9) = -^y, a{9) < x < b then 

L (x, 9) = n [u(xj) y/(b, Xj ) \f/(x h a(0))], x e R n 9 e Q 


1 

[v(9)] n 


n[u{x t ) \f/{b, Xj) y/ixj, x (1) )] \f/(x {l) , a(9)) 


since n y/(x;, a(9)) = y<x ( i), a(9)). Therefore 


x ~ y iff \^x (l) , a(9)) = V<y (1) , a(0)) 


(2.6.2) 



42 A First Course on Parametric Inference 


Again if x^) y (1 ) we can determine a value of 0 such that a(0) is between 
JC(D andy (1) and (2.6.2) does not hold. Thus ~ defines the minimal sufficient 

statistic T = X (l) . . t 

(C) When a(9) and b{9) both depend on 0 then if a(9) i and b{9) T, then 
T(x) = max { a~ l (x (1) ), b~\x w )} is minimal sufficient for 9. Here we have 


L(x, 9) = 


1 

f mr 


n[u(xj) y/(x h a(0)) y/(b(9), x,)] 


= — 7 — n[u( Xi ) y/(x h x (l) ) y/(x (n) , x,)] y/(x a) , a(9)) y(b(9), x ( , 0 ) 

fr( 0 )] 

Now observe that a(9) < x (J ) iff tf _ 1 C*(i)) < 0 s ^ nce a (® ^ 

and x ( „) < b(9) iff b~ l {x M ) < 9 as b(9) T 

Therefore, a(9) < x {]) < x (n) < b(9) iff max {a~\x w ), b' l (x {n) )} = T{x) < Q 
and yKx (l) , a(9)) \p(b(9), x (n) ) = \fK9, T(x)). It then follows that x ~ y iff 
\j/(9, T(x)) = y/(9, T(y)). Again arguing as before we conclude that T(x) = 

T(y ) and ~ defines the statistic T = max b~ l (X (n) )). 

(D) When a(9) T and b(9) I then in a similar way we can show that ~ 
defines the statistic T = min {< 3 _I (Ar ( i)), b~ l (x M )}. 

Example 2.6.1 Consider Tfo ^ = ~ k x k Here S 6 = (- 9, 9), 

Q, = (0, oo) and a(0) = - 0 i and 5(0) - 0 T. Then a -i (x (1) ) = - x {{) and 
b~\x( n )) = x (n) and T = Max (- X (1) , X {n) ) is minimal sufficient for 0. To 
obtain g 0 (t, 9) consider the d.f. of T. Note that the range of T is (0, 0) and 
thus G(t, 9) = P[T < t] = 0 if t < 0 and = 1 if t > 0. For 0 < t < 9 we have 

n x oo , 1 

n(n - 1 ) (x in) - Xd ))”' 1 dx (l) dx {n) 

Since max (- x (1) , x (/j) ) < t iff -1 < x (]) < x (n) < t and the joint pdf of 
CY (1) , X (n) ) is the integrand above. Therefore 

G(t, 0) = 0 if t < 0 

= <70* if 0 </< 0 
= 1 if t > 0 

Observe that G(t, 9) is the d.f. of Y (n) where (Y h ..., Y n ) are i.i.d. U{ 0, 0). 
Now we have max [- x (J) , x (n) ] = max (I x x I, I x n \ ). Let Y = I X I then 
as X varies over (- 0 , 0 ), Y varies over (0, 0) and P[Y < y] = P[- y < X 

< y] = J_ v J0 ^ = ? for 0 < y < e and ( Y b •••, y„) are i.i.d. 17(0, 0 ). The 
sufficient statistics for 0 based on Y is 




Sufficient Statistic 43 


max (Yi, ..., Y n ) = max ( I X { I , \ X n \) - max ( X a) , X {n) ) 

When a(9) and /?(#) are not of the above four types then we show by an 
example that there is no one dimensional sufficient statistic for 9 although 
(X (1) , X (n) ) is minimal sufficient. 

Example 2.6.2 Let X 2 be i.i.d. U(0, 0 + 1), 0 e R , then a(0) t and 
b( 0 ) t and Ux, 0 ) = 1, 0 < x, < 0 + 1,« = 1, 2 and if X (1) , X (2) is the order 
statistic which is sufficient for 9 , in view of the general result for samp es 
on continuous r.v. 

go(*d)> x ( 2 )’ 0) = 2! 9 < *( 1 ) < *( 2 ) < 0 + 1 

= 2! v<*(i)> 0) x (2) - 1) 

Now 

(JC( 1 ), JC (2 >) ~ 0(1), y (2) ) iff Y( x (i)> 0) 11/(9, x {2) - 1) = y/(y { u , 0), y/(9, y {2) - 1). 

Now if either x (1 ) ^ y(i) or X ( 2 > * y^ 2 ) we can find a value of 9 in R\ such 
that the above equality fails and therefore we have (X (n , X (2) ) is minimal 
sufficient. 

It is an interesting exercise to obtain the distributions of the minimal 
sufficient statistic in the four cases discussed above and the reader is strongly 
recommended to derive these distributions using the distribution theory of 
order statistics. For further details we refer to Huzurbazar (1955). 


I 



Minimum Variance Unbiased 

_-Estimation 


/ 

\ 3.1 Unbiasedness 

Consider a random sample of size n from {/(*, 9), 9 e £2 c Rj} and 
suppose that T is an estimator of the function y/(9) which is of interest. Let 
{#(*> 0), 9 e ,£2} be the corresponding class of pdfs. Then we evaluate the 
performance of T as an estimator of y/(9) using the sampling distribution 
of T. The difference between T(x) and \j/(9) given by e(x, 9) = T(x) - y/(0) 
for each fixed 9 and given x is the residual or error. We regard errors of 
overestimation and underestimation on par and thus a natural quantity to 
look at is I T(x) - y/(9) I following Boscovitch and Laplace or (7’(X) - \f/(9)) 2 
following Gauss and Legendre. Since (T(x) - \f/(9)) 2 is a r.v. for each fixed 
9, we consider E 0 [(T(x) - y/(9)) 2 ] = MSE (T\9) and prefer T { over T 2 if 

MSE (T l I 9) < MSE ( T 2 I 9) for V 9 e £2. Recall that by Tchebychev’s 
inequality for any estimator I , for any e > 0 

Pel I T(x) I < e] > 1 - 1V1SE( f 1 6) 

e 2 

and MSE (T I 0) in a way indicates concentration of the distribution of T 
around y<0) and we prefer an estimator with greater concentration around 
yKS) which corresponds to smaller mean squared error. However just working 
with MSE criterion does not lead to the unique choice or recommendation 
of an estimator T as the following examples shows. 


Example 3.1.1 Consider (X lr ...,X„)i.i.d.N(6, 1), T t (x) = x and T 2 (x) = 6 0 . 
Then as X ~ N(0, 1 In) we have MSE (7j I 6) = 1/n and MSE (T 2 I 8) = 


(0 0 O ) 2 . Now for values of 0 e | @ 0 - —, 0 O + ij, MSE (T 2 1 0) < MSE 

(T, I 0) and for other values of 0 we have MSE (T 2 I 0 ) > MSE (7j I 0) and 
we cannot choose r, over T 2 or T 2 over T l . Observe that 7j is minimal sufficient 

completely. 2 ” 0t Pe " d 0 " observations at all > and ignores the data 

and’fht 8 baCk t0 Boscovitch ’ s assumption that sums of residuals is zero 
Chamet. nSeque "‘ assumption on the error distributions discussed in 
symmetnV T C ° uld ™P 0Se edition that the distribution of T should be 
round vr(0) and the centre of the distribution of T given by its 



Minimum Variance Unbiased Estimation 45 


first moment should coincide with iff (0), the parametric function that we 
want to estimate. Now if E e (T) exists and the distribution of T is symmetric 
around 9 we have E e (T) = iff {9) but E e {T) = iff {9) does not necessarily 
imply that the distribution of T is symmetric around 0. We therefore impose 
the weaker condition namely 

E e (T) = y/(9) V 9 e £2 (3.1.1) 


or J g(t, 9) dt = y/(9) V 9 e £2. 

If T satisfies the above condition then we define T to be an unbiased 
estimator of iff(9). Usually there are several unbiased estimators of iff(6) 
and we denote by U ¥(ff) = {T I E£T) = i f/(9), V 0 e £2}, the class of all 
unbiased estimators of iff(6). 


Example 3.1.2 Let (Xj, ..., X n ) be i.i.d. N(9, 1) then X ~ N(9, 1 In) and 
therefore E d (X) = 9 V 9 e R { and X e U e . Similarly if T a = X a i X i /'La i 

then T a ~ N(9, Za?/(Xa,) 2 ) and T a e U e for any a e R n - { 0}. Since 
normal pdf is symmetric about mean, T a e U d has its pdf symmetric about 
9. 

On the other hand consider (X t , ..., X n ) i.i.d. exponential with mean 9 


having pdf J(x, 9) = e x/e , x > 0, 9 > 0. Let T x - X, then T { e U e and 

the pdf of X { is same as f(x, 9) given above which is certainly not symmetric 
about 9. Further if T 2 = X then T 2 € U 0 but its pdf given by g(t 2 , 9) = 


n n i 2 ~ X 

r(n)9 n 


-ntilO 


x t 2 >0, 9 > 0, is not symmetric about 9 for moderate values 


of n. Note that due to central limit theorem we have T 2 is approximately 
normal with mean 9 and variance 9 2 /n and therefore for large n the pdf 
g(t 2 , 9) is nearly symmetric about 9. Reader should verify that in case of 
(Xj, ..., Xfi) i.i.d. Poisson with mean 9 , a similar phenomenon occurs. 


We say that a parametric function y/(0) is estimable if U ¥(d) is not empty 
or there exists at least one unbiased estimator of \f/(9). Note that if U ¥{d) 
contains two distinct elements T { and T 2 then it contains infinitely many as 
any convex combination of T b T 2 would also belong to U m . In general 
whether a function iff(9) is estimable or not is connected with the problem 
of solving the integral equation 


J T(x) L(x, 9) dx= if/(9), V 9 e Q (3.1.2) 

where L(x, 9) and iff(9) are known and T(x) is to be obtained. When 
L{x, 9) is pmf of a discrete r.v. (3.1.2) reduces to the equation 

I (p(x) L(x, 9) = iff(9) 


» 


(3.1.3) 



46 A First Course on Parametric Inference 

and particularly in case where (/(*, ft), * * Q) is a power aeries distribution 
we can use the technique of equating coefficients 0 ’ r ’ en 

is an analytic function. We illustrate this by an example. 

Example 3.1.3 Let (X„ X 2 ) be id.d. Bernoulli with P(X, = 1) = ft and 
P(X { = 0) = 1 - ft Then L(x, ft) = ft«*« 0 - e) Wnl - Consider tpfft) = 

6(1 - ft) = P(X, = 1 and X 2 = 0). Then by using indicator function, define 
T(x ,, x 2 ) = 1 if x, = 1, *2 = 0 and zero otherwise. Then E(T) - ft(l - ft), 

V ft e f2 = (0, 1) and yz(ft) is estimable. On the other hand let <p{x lt x 2 ) 
be an unbiased estimator of 0(1 - 0) then 

(p(0, 0) (1 - 0) 2 + [<p(0, 1) +. °)1 - 0 ) + <P( 1 ’ 1)02 

= 0(1-0) (3-1.4) 

Since (3.1.4) holds for each 0 e (0, 1) coefficients of 0 r , r = 0, 1, 2 
must be identical. Since 0° has zero coefficient on RHS we must have 
<p(0, 0) = 0. Equating coefficients of 0 and 0 2 we have 

(p( 0, 1) + cp( 1, 0) = 1 

and (p( 1, 1) - [(p( 0, 1) + (pi 1, 0)] = - 1 

This gives <p(l, 1) = 0 and <p(0, 1) + <jp(l, 0) = 1. The estimator Tix b x 2 ) 
given above is a particular case of the general solution (corresponding to 
a = 0) given by 

tpiO, 0) = 0 = <p(l, 1) 

and (piO, 1) = a, <p(l, 0) = 1 - a 

where a e R { (3.1.5) 

Indeed (3.1.5) defines the entire class of unbiased estimators of (/#i_ e) . 

Suppose we now take yr(0) = 0 3 then RHS of (3.1.4) is 0 3 . Since the 
coefficient of 0 3 on RHS is unity and that on LHS is zero it follows that 
no matter how we define (pix h x 2 ), we can not have 

(piO, 0) (1 - 0) 2 + [(piO, 1) + (pil, O)]0(l - 0) + <p(l, 1)0 2 

= e\ V 0 e (0, 1 ). 

Therefore yiO) = 0 3 is not estimable. However if instead of a sample of 
size two we have a sample of size three then we can show that $ is 
estimable and U is given by 

(piO, 0, 0) = 0, (pil, 0, 0) + <p(0, 1, 0) + (pi0 , 0, 1) = 0 

(pil, 1, 0) + ^(1, 0, 1) + (piO, 1, 1) = 0 and <p(l, 1, 1) = 1. 

In the case of continuous r.v. X, the problem is more difficult and one 
has to use the technique of Laplace Transforms among others to obtain the 
solution and we will not go into further details. 


Minimum Variance Unbiased Estimation 


47 


•pkercise 3.1.1 (1) Consider a Poisson distribution with zero 

Y ... f „ A* 1 X - l 2 .... This model arises in a 

discrete r.v. X with pmf M A) = - x - h 2. ^ ^ parameter 

Poisson situation in which the event X = 0 is not observable. ^ 

A is estimable by solving the equation X —p T(x) = Me* - 1)* Expand M 

v= i * x — 2 3 • * 

a power series and equate coefficients to obtain 7(1) = 0, T(x) = x, 

(2) In the above case show that -j- is not estimable. 

^ n ' al 

(3) Let (X^ ..., X„) be i.i.d. exponential with mean 0 and let T = X X, be mi 


sufficient statistic for 0 with pdf given by git, 6) = 


j _ r l e~' 10 , t>0, 6 >0. Show 




that for any r > 0, ft is estimable with u r (t) = as an un bi ase ^ estimator of 

k k 

ff. Let P k (0) = X a r Q r be a polynomial degree k in ft Then X a r u r (t ) e Up k (ey 

r =0 f=0 

(4) In the above example show that - j~ - y € t/ 1/0 if n > 1 (For n = 1, using Laplace 

transform theory one can show that -X is not estimable i.e. Uy 6 is empty). 

We point out that Gauss had used a different definition of unbiasedness 
in the problem of direct or indirect measurements on the “magnitudes of 
interest”. Thus in the simplest linear model X f = 0 +£,-,* = 1, 2, ..., n, 
Gauss defined an estimator T(x h ..., x n ) to be unbiased for 9, if T(9, 

6) = 9. Thus if we happen to observe 9 without any error i.e. £,• = 0, i — 
1, 2,..., n then the estimator T must coincide with 9. When £,’s are continuous 
P(£j = 0) = 0, i = 1, 2, ..., n. However if £, are discrete then we can have 
an estimator T which is Gauss unbiased but not unbiased as defined by 
the condition E(T) = 9, V 9 € £2. For example if error distribution is 

P(£, = - 1) = I P(£, = + 1) = \ and P(£, = 0) = i then X 


is Gauss 


unbiased but X £ U e . 


3.2 Best Linear Unbiased Estimator (BLUE) 

In the context of repeated direct measurement model, where X, = 0 + £. i 
= L2.n, Gauss proved an interesting and important result that within 

the class of linear unbiased estimators, X is BLUE. The result holds under 
the assumption that £,-, i = 1,2, ..., n are independent with E(e) = 0 and 

Var (£,•) = o 2 which implies that £(X,-) = 0 and Var (X,) = o 2 . Consider the 
sub-class of U& namely linear unbiased estimators, U e (a) = {I aX I 

= 1}. Then X is “Best” within U e (a) in the sense that for any T{a) 6 Ufa) 

we have Var ( X) = ^ <. Var (T ( a )) = I a} <j\ The result is easy to prove. 
The restriction I a { - 1 follows from the requirement that £(£ a x'.) = 

0 la, = 0, V 0 € Q which implies that la,- = 1. Now Var (7(a)) = a 2 1 
and is to be minimized subject to a linear restriction that £ a, = 1 One may 




48 A First Course on Parametric Inference 


use Cauchy-Schwartz inequality (2) ^A) 2 - ^ a i ^ or use Lagrange s 
method of undetermined multipliers to show that the minimum of 2 a? 

subject to constraint X = 1 occurs at a\ = <22 _ ••• “ a,t ~ 

We point out the fact that the above result of Gauss, also known as 
Gauss-Markov theorem, does not assume a fully parametric model for the 
error distribution or equivalently no specific distributional orm or the 
distribution of Xj is assumed. The only assumptions made are E(Xj) = 0 and 
Var (Xj) = <x 2 and Xj s are uncorrelated but not necessarily mutually 
independent and even not necessarily i.i.d. as they can differ in higher 
order moments. Observe that if we use the definition of unbiasedness as 

originally proposed by Gauss then T{a) = X ajXj is unbiased if X a t 0 = 0 , 
V 0 e n or X <z, = 1 . In this case also we minimize X a { 2 subject to X a f 
= 1 and X is BLUE of 0. 

Example 3.2.1 Let Xj = 0 + £; where £,• are i.i.d. with zero expectation and 
finite second order moments. These assumptions allow various distributions 
such as Normal, Laplace, Uniform, and Euler. 

( 1 ) /(£) = 1 -- exp {- ^/2o 2 } £ € /?! E(e) = 0 Var (£) = 0 2 

'lino 2 

(2) /(£) = exp {- I £ 1/cr}, £ € £(£) = 0 Var (e) = lo 2 


(3) m = ^,-a<e<a 


E(e) = 0 Var (e) = ct 2 /3 


(4) /(e) = —^-7 (cr 2 - e 2 ), - e < cr < e, £■(£) = 0 Var (e) = cr 2 /5 
4<t j 

For all the above distributions we have X is BLUE of 0. On the other hand 
the result is not applicable if errors are Cauchy distributed with pdf 

1 


m = & 


x O l + £ 


, £ € R j. 


Example 3.2.2 The result that X is BLUE of 0 holds when (Xj, ..., X n ) 
are i.i.d. Poisson with mean 0, or exponential with mean 0. As seen earlier 
the distribution of X is not symmetric around 0 and the observations are not 
arising out of repeated measurements model *, = 0 + 

The restriction to the class of linear unbiased estimators may create 
some problems in that such a sub-class of U e may be empty but U e itself 
may not be empty. We consider two examples to illustrate this point. 


Example 3.2.3 Let (X h ..., X n ) be i.i.d. with pdf f(x, 0) = exp {- (jc - 0)}, 
^ 9 and 8 s Then E(X - 8) = l and E(X) = 6 + 1 or E(X - 1) = ft 
Hence U e is not empty. However if we restrict to class of linear estimators 

T<fl) = £ then E(1 a Xi) = 2 0,(6 - 1) = 2 a,© - 2 a. Then T(a) € 



Minimum Variance Unbiased Estimation 49 


U 6 iff 9 I a, - I a, = 0, V 0 e or iff X a, = 1 and I a, = 0 which is 
impossible. This proves the result that the class of linear unbiased estimators 
is empty. Of course we can generalize the class of linear estimators by 

defining it to consist of T\ = a 0 + X ( a 0 , a lf o n ) t e R n +\ i.e. 

/=i 

allowing non-homogeneous linear estimators. In this problem one can show 
that (X — 1) is BLUE of 0 in the larger class of non-homogeneous linear 
unbiased estimators of 6. 


Example 3.2.4 Let (X lt .... X n ) be i.i.d. with pdf f(x t 9) = 0x fM , 0 < x < 1, 

and 0 < 9 < 1. Then E(X) = and E(T(oq, ..., a n )) = ao + q ~ j ^ a i 

and T(a 0 » • ••, a n ) e t/ 0 iff a 0 (9 + 1) + 0 X a, = 0(0+1) for every 0 € 
(0,1). On comparing coefficients of 0 2 this leads to contradiction 1=0 and 
thus the class of non-homogeneous (and therefore homogeneous) linear 
unbiased estimators is empty. On the other hand U e is not empty if n > 2. 
This follows from the fact that Y, = - log X, are i.i.d. exponential with mean 


-£■ and therefore T = X Y, is Gamma n, and -j = 0. Note that 

T is also minimal sufficient as {/(*, 0), 0 € (0, 1)} is a one parameter 


exponential family. Note that ^—L not a linear estimator even in 
(Yj, ...» Y„) = (log X\, ..., log X n ). 

Another difficulty with the BLUE is the fact that where as X may be 
BLUE of 0 there may exist non-linear estimators in U e whose variance 

could be smaller than that of X. Again we illustrate this situation by an 
example. 


Example 3.2.5 Let (Xj, X 2 , X 3 ) be i.i.d. U{9 -1,0+1) then f(x, 0) = 

for 0 - 1 < x < 9 + 1 and E(X) = 9 and Var (X) = . Therefore X is BLUE 

of 0 with Var (X) = ^. Consider the order statistic X (1) , X (2) , X (3) . We 
consider T\ = (X (1) + X (3) )/2 and T 2 = X (2) as estimators of 0. Let Y = 

~ ~ 2 + ^ then Y - U( 0, 1) and let (Y (1) , Y (2) , Y (3) ) correspond to (X (1) , X i2) , 
X i3) ). Then it is easy to show that E(Y (1) ) = 1, £(Y (2) ) = £(Y (3) ) = 

Therefore both 7j = j(X (1 ) + X (3 )) and T 2 = X (2) are unbiased for 0. Now 
we can show that Var (Y (1) ) = -^ = Var (Y (3) ) and Cov (Y (1) , Y (3) ) = 

Using the relation X (i) = 2Y (/) + 0 - 1, we obtain Var ■ X( - ° + X(3)) = ^ 
which is smaller than that of BLUE X which is ^. Note that Var (X (2) ) = 




50 A First Course on Parametric Inference 


4 Var (Y i2) ) = j and although X (2) e U e , Var (X (2) ) > Var (X) > 

Var ax m j 

The approach based on BLUE has its roots in linear models where it is 
natural to restrict attention to linear unbiased estimators of parameters of 
interest. Particularly when the errors are assumed to be normally distributed, 
the method of least squares leads to the BLUE which also happens to be 
Minimum Variance Unbiased (MVU) estimator. If the errors are not normal 
then of course this result is not true as the above examples show. In the 
next section we consider the problem of MVU estimation of 6 (or of y/(9)) 
using the approach based on Cramer-Rao Lower Bound (CRLB) to the 

variance of an unbiased estimator. 

/ 

J3.3 Cramer-Rao Inequality and Its Applications 

Let X be a random vector (Xj, ..., X n )' with joint pdf of X belonging to the 
class [fix, 9), 9 e Q <z Ri). Assume that this class satisfies the regularity 
conditions given in Sec. 2.2 so that Fisher Information Ix(9) is well defined. 
Recall that these conditions are 

(1) S e = { x \fix, 9) > 0} does not depend on 9 or S e = S. 

(2) The identity J f[x, 9) dx = l V 9 e Q, can be differentiated under 
integral sign twice. 

Consider U ¥ = {T{x) I E(T(x )) = i j/{9), V 9 e £2} which we assume to 
be non-empty. We will now establish the Cramer-Rao Lower Bound (CRLB) 
for Var (7) under the following additional regularity condition: 

(3) Te Uy is such that differentiation under integral sign is valid at least 
once or for every 9 e Q, we have 


jg J TOO f(x, 8) dx = J T(x) 


d log/ (6) 


, fix , 9) dx = 


dy/ 


99 ’'-- 99 

Theorem 3.3.1 Under the above regularity conditions, we have 

(d ^ 1 


(3.3.1) 


Var IT) 2 


dS 


) 


Ix(8 ) 


Noting that E 


fdlogf(x, 0) 
{ 90 


= 0 the condition (3) implies that 


Covl T 


99 


J 


dy/ 

~d9 


Now using the fact that (Cov) 2 < product of variances, we have 

dy/\ 2 

-3S-J ^ Var (T) I x (9), 



Minimum Variance Unbiased Estimation 51 


since Var 



= /*(0). We have therefore 


Var (T) > 




(3.3.2) 


The RHS of (3.3.2) is called as the CRLB for Var ( T ). 

We have already seen in the Section 2.2 on Fisher Information, that 
verifying the condition (2) above is not always easy. In a similar way 
verifying the condition (3) of differentiation under integral sign in E e (T) = 
is not always easy. Further the CRLB of variance of an unbiased 
estimator T e U ¥ would be a valid lower bound for the class U ¥ , as 
opposed to an individual element T e U ¥ , provided the condition (3) holds 
for every T e U T Thus we assume that the condition (3) holds for every 
Te U Y as condition (4). 

Theorem 3.3.2 If conditions (1) through (4) hold then for any T e U ¥ , 

Var w 

equality holds, then T* is MVUE of 1/(0) • 

The above theorem presents a complete solution to the problem of MVUE 
in case the model is such that the regularity conditions (1) through (4) are 
satisfied and existence of T* is assured. 


jlx(6) and if there exists a T* e U ¥ for which the 


Example 3.3.1 As an example we consider the case of a random sample 
of size n from b{ 1, 0) or (Xj, ..., X n ) are i.i.d. with P(X,- = 1) = 0, P(X, = 
0) = 1 - 0, 0 < 0 < 1. Then the joint pmf of X is f(x, 0) = 0 ^*' (1 - 0)" - ^ x ', 
Xj = 0, 1, i = 1, 2, ..., n, 0 < 9 < 1. Here S e is the set of all 2” sequences 
of zeroes and ones and the condition (1) holds. Further the identity 

£ T(x)8 lx ' (1 - ey- 1 *’ = e has 2 n terms on LHS. Therefore, 

xes 

differentiation under the summation sign is valid and the conditions (1) 

through (4) hold. Therefore for any T e U& Var ( T) > — ft — — since 

, ft 

dy/ n 

= 1 and !x(0) = ~q~( \ Z Qy Now in view of the fact th at (X,, ..., X„) 

are i.i.d. with E(X t ) = 6 and Var (X,) = 0(1 - 0), X is BLUE of 6 with 
- 6 (\ — 6 ) - 

Var (X) = -- and thus X is MVUE. In this case S consisted of a 

fl> 

finite number of elements and the conditions (3) and (4) were easy to verify 
as the derivative of a sum of finitely many terms is the sum of their 
derivatives. However in the continuous case such as N(9, 1) or the Poisson 
case wherein an infinite series is involved, the verification of conditions (3) 
and (4) is not easy. We will however show in the following example that 
for a sample of size n in N(0, 1) the conditions (1) through (4) hold and 
X »s in fact MVUE of 0. 


52 A First Course on Parametric Inference 


Example 3.3.2 Let (X,.X„) be i.i.d. N($, D then using the results of 

Section 2.2 we claim that conditions (1) and (2) hold and /x(0) - Now 
consider any T(x) e 1/$. Then the fact that X (*/ — ^) — ( x i ~ x) + 


/i(Jc - 0) 2 and \A\\ = 


T(x) 


L(x, 0 q + h) - L(x, do) 

h 


, using arguments 


similar to Example 2.2.2 we have for 0 < I h I ^ 5 


1/4, I < I 7"(jc) I exp 


f-X(x,-x ) 2 




-n(x - 0 0 ) 


+ n I x - 0a I 


= G 0 (x) 


j G 0 (x)dx< j-jf I T(x) I L(x, 9 0 + S) dx + J I T(x) I L(x, 0 O - 6) dx> 

(3.3.3) 

Since E(T(x )) = 6 V 6 e the integrals on RHS of (3.3.3) exist and 
therefore G 0 (x) is integrable and the conditions (3) and (4) hold. Therefore 

for any T € Uq, Var (7) > —. The CRLB of — is attained by Var ( X ) and 
_ n n 

thus X is not only BLUE but is also MVUE. 

We recommend to an enterprising reader to work out similar results for 
a sample of size n from Poisson with mean 0 and also for exponential 
distribution with mean 6 to show that CRLB for estimating 9 is attained by 
X and X is not only BLUE but is also MVUE of 9. Indeed one can show 
that for random sample of size n from a distribution belonging to one 

parameter exponential family, the CRLB result holds but we will not pursue 
the problem. 

Suppose thal the model {L{x, 0), 0e Q) andTs U v are such that CRLB 
result holds. Then we call T* to be a Minimum Variance Bound Unbiased 
(MVBU) estimator of y/(9) provided 


Var (T*) = 



(3.3.4) 


Now Var (T*) attains the CRLB iff cov |V* = y ar (T*}I x (0 

or the correlation coefficient between T* and - log - i s + l Thu 

d9 


(3.3.4) holds iff II ~ yw = + L / rj-firr 

VVar(T*) ~ 99 Noting 


th; 


Vvar(T*) = 


dy/ 

dO 


we must have 



Minimum Variance Unbiased Estimation 53 


T* = vr(0) ± 


By 


dG 

01og L 


J*(0) BG 


(3.3.5) 


Observe that the RHS of (3.3.5) can be computed once the model {L(x, 0), 
0 € 0} and the function y(G) is specified and we can immediately check 
whether or not there exists a T* which is MVBUE for y(G). 


Example 3.5.3 Let (X h ..., X n ) be i.i.d. Poisson (0) then 


., _ „„ . 01ogL Ex, n(x-G) 

L(Ar ' @) = " HxJ mi -9T = - n + — = -~e— 

By 

Further 7*(0) = Suppose we want to estimate y(G) = 0 then - 


and equation (3.3.5) becomes T* = 0± ^ ~^~q ~ • ^ we ta ^ e P os ^^ ve 

sign on RHS, T* = x and therefore X attains CRLB and is BLUE, MVBUE 
as well as MVUE of 0. On the other hand suppose we want to estimate 
y(G) = e~ e = P[X = 0] then estimator defined by T(x) = 1 if X\ = 0 and zero 


otherwise is unbiased for y(ff) and U ¥ is not empty. Further 



Therefore equation (3.3.5) becomes T* = e 6 ± -—— • and whether 

Yl Cf 

we choose positive or negative sign on RHS, in both cases T* is a function 
of x and 0 and it is not a statistic. Hence we conclude that there does not 
exist an MVBU estimator of y (0) = e~ e . The CRLB for y(G) = e~ e however 
is well defined and is given by Oe~ 2e /n . The question whether there exists 
MVUE in Uy, remains however open. Another way to write (3.3.5) is to 
observe that 


where 


01ogL 

00 


= A (0) [T* - yiG)] 






(3.3.6) 


Assuming that A(0), y(G) and A(0) y(d) are integrable w.r.t. 0, integrating 
(3.3.6) w.r.t. 0, we get 

log L = u(0) T*(x) + u(0) + w(x) where 


u(0) = J A(0) dG and v(G) = - J A(0) y(G) dG. 

Thus {L(x, 0), 0 e Q c R\) is a one-parameter exponential family with 

T*(x) as minimal sufficient statistic such that E(T*(x )) = - j~ = y(G). 

Conversely if [L(x(G)), 0 € Q} is a one-parameter exponential family 
with pdf given by 



54 A First Course on Parametric Inference 


log L (x, 0) = u(6) T(x ) + v(6) + w(x) (3.3.7) 

then one can show that y/{9) = E(T(x )) = - T(x) is MVBUE as 

well as MVUE of y/(6). We can show that conditions (1) and (2) hold and 
we have 

— ^ = u\6)[T{x) - y<0)] where u denotes the derivative of u w.r.t. 9. 


d 2 log L 
d6 2 


= u"(9) [T(x) - y/(9)] - u\9) 


dy 

~d9‘ 


Hence 7*(0) = E 

2 


(~d 2 \ogL\ 

* 2 J 


= u ' (d) w since E(T(x)) = ^ But 


E {^r) = Ix(6) = [u ' (d)]2 Var (T(x)) therefore 


Var (T(x)) = = 

[ k '( 0)] 2 


/*(») 




de) 


d\ff 

~d9 


^ 2 


Ix(O). 


A detailed proof of the above result particularly the proof that for T\ e U ¥ 
conditions (3) and (4) hold will not be attempted here. An enterprising 
reader can attempt this by following techniques of Examples 2.2.2 and 
3.3.2 which show that for N(0, 1) family conditions (1) through (4) hold. 
The techniques illustrated in the above examples are to be applied after 
making a one-one parametric transformation <p = u(9) and writing (3.3.7) 
in the form 


log L { (x, </>) = </> T(x) + v { ((p) + w(x). 

It now follows that even in one-parameter exponential family the only 
parametric function which admits an MVU estimator whose variance attains 
the CRLB, is the function y/(9) = E(T(x)), where T is minimal sufficient 
statistic. As mentioned in Example 3.3.3, the question still remains open as 
to whether a function ?7(0) admits an MVU estimator which is not MVBUE 
i.e. whose variance does not attain CRLB. In the next two sections we will 
present a partial solution to the problem of MVU estimation using Rao- 
Blackwell-Lehmann Scheffe approach. 

3.4 Rao-Blackwell Theorem 

This theorem was a notable advance towards the solution of the problem of 
MVU estimation of an estimable parametric function y<0). Suppose that the 
family {L(x, 0), 6 e £2} is such that it admits a sufficient statistic T with 
corresponding pdf {g(t, 6), 0e ft}. Let Tj e U ^ then first we observe that 
if we take conditional expectation of T { given T = t then E{T X I T=t) = (p { (t) 


Minimum Variance Unbiased Estimation 55 


does not depend on 9 for each t and therefore E(T\ I T) = <P\(T) is a statistic. 
We then show that q>\(T) e U v and further Var (TO > Var (<jt>i(T)) for every 
0 g ft. Thus is an improved estimator obtained from T [ by taking 
conditional expectation for fixed T, a process called as Rao-Blackwellization 
of T[ w.r.t. a sufficient statistic T. Now if T is not minimal sufficient and M 
is minimal sufficient statistic then we can apply the process of Rao- 
Blackwellization to r 2 = <P](T) and obtain E(T 2 1 M) = (faiM) such that <jP z(M) 
e U ¥ and Var (fp x (T)) > Var (<h(M)) for each 0 e ft. Thus the theorem 
implies that the search for the MVU estimator of y(9) be restricted to those 
elements of U ¥ which depend on X only through the minimal sufficient 
statistic M, or MVU estimator of y/(9) must be a function of M. Now we state 
and prove Rao-Blackwell Theorem. 

Theorem 3.4.1 Let { L ( x , 0), 9 e ft} be such that a statistic M is minimal 
sufficient for 9 and U ¥ is nonempty. Let T x e U ¥ and <P\(M) = E(T\ I AO- 
Then 

(a) (pi(M) € U v and (b) Var (<Pi(M)) ^ Var (TO, V 9 € ft. 

Let g(t, m, 9) be the joint pdf of T { and M. Then as M is sufficient we have 
g(t h m, 9) = g 0 (m, 9) g\(t { I m) for each 9 e ft. Now <Pi(m) = E(T { I M = in) 

= | hsdh I m) dt[ and therefore 

E(q>i(M)) = f [ f fi£i(ri I m) dt { ] g 0 (m, 9) dm 



, m, 9) dt\dm 


- E(Ti) = \l/(9) and (pi(M) e U r 


Now 


Var (TO = || (t\ - Y(0)) 2 g(h> 9) dt \ dm 

= || [t { - (pi(m ) + (p\(m) - m? 8i(h 1 m ) 9) dt \ dm - 

Expanding the square in the integrand we have RHS = I x + I 2 + h- 
Consider each I n r - 1, 2, 3 separately. 

/, = || (fj - <pi(m)) 2 gj(f| I m) g Q (m, 9) dt x dm 

= | [| (^ - <Pi(m)) 2 gi(U I m) dt x ] g 0 (m, 9) dm 
= £[Var (Ti I M = m)]. 



56 A First Course on Parametric Inference 


Note that f (t, - «.,(m)) 2 *■<». 1 «> *■ = Var (T|IM = m) iS the COndlt, ° nal 
variance of Ty given M — m since E(T\ I M — tw) — 

Similarly, let 


/ 2 = JJ (<p,(m) - <1*0)? *i(»t I ») «o( m . < *i dm 

= J [<Pi(">) - V<«)] 2 So(m. S) 

= Var [<Pi(A0]- 


Now consider 


/ 3 



[fi - ^>i(m)] [^£>i(m) - m] 8i(h I w ) £o( m > 


= 2J [<Pi(m) - *fl)][J (#i - <Pi(m)) gi(*i I m) dt x ] g 0 (m, 0 ) dm 
= 0 


Now J ( t x - (f>i (m)) gjfo I m) dfj = 0 for each fixed m follows from the fact 

that E(T X I M = m) = (p x (m ) for each fixed m. Thus Var (7 t ) = Var (<jOi(M)) + 
£fVar (T x \ M = m)]. As Var (7, I M = m) > 0 we have £[Var (T X \M = m)] 
> 0 and therefore 


Var (TO > Var (<Pi(M)) (3.4.1) 

Observe that equality holds in (3.4.1) iff Var (7j I M = m) = 0 or the 
conditional distribution of T\ given M-m is concentrated at its conditional 
expectation (p x {m ), i.e. T x is already a function of the minimal sufficient 
statistic M. 


Example 3.4.1 Let (X h ..., X n ) be i.i.d. N(0 , 1) then we know that X is 
minimal sufficient for [L(x, 0), 0 e R x }. Let ^(0) = 0 then observing that 

X ~ N{6, \lri) we have X z Uq and is itself function of minimal sufficient 
statistic. On the other hand suppose we take 7, = X x then 7! e U e . As the 

conditional distribution of 7, given X = x is W(jt, 1 - 1/n) it follows that 

E(T\ I X = x) =~(p(x) = x for each x. Hence Rao-Blackwellization of 

7] = *i leads to X . Note that if we had started with any 7,- = X h i = 1,2, 

n ' its Rao-Blackwellization would also lead to same statistic <Pj(x) = x, 

1 ~ 1* 2 » *•’ n - Now let 7 = I where I /, = 1 so that 7 e U e . Then 




i in 


1 In 1 In 


andth 


(7, ^f) is BVN with mean (0, 0)' and covariance matrix 

■conditional distribution of T given X = x is N(x,£lf - l/«). Therefoi 
Rao-Blackwellization of T, any linear unbiased estimator of 6 leads to i 


Minimum Variance Unbiased Estimation 57 


only. Note that we have already seen that X is BLUE as well as MVUE 
si nce its variance attains the CRLB. Thus one should expect that for any 

T e U e Rao-Blackwellization with respect to the minimal sufficient statistic 
X should lead to X itself. 

Next consider the situation where we want to estimate \i/ 2 (0) = 0 2 . Then 
again observing that E(X 2 ) = 0 2 + I we have tp 2 (X) = X 2 -~eU 02 
which is a function of minimal sufficient statistic X . Let now T { = X 2 - 1 

then as E(X, 2 ) = « ! +lwe have T, e C/ <2 and E(X? - 11 *) = x 2 - I as the 

/I 

conditional distribution of X 1 given X = * is n{x, 1 - -) and therefore 

j V n) 

~ ^ * x ) = x 2 ~ ~ f° r each x fixed. Indeed Rao-Blackwellization 


of X f - * 1, 2, ..., n will lead to (Pj(X) = X 2 ——. If we take 

1 n 
T — ^^X f - ~ f which belongs to U 02 , its Rao-Blackwellization would 

again lead to X 2 - 1 In. The question whether X 2 -1/n is MVUE of & can 
not be decided using the CRLB result. One can show that there does not 
exist an unbiased estimator of 0 2 , whose variance attains the CRLB. This 

follows from the fact that — = n(x - 0) and T - 0 2 ± 1 201 n (x - 0), 

and for any choice of sign, positive or negative, T does not become purely 
a function of X. Moreover CRLB for y{0) = 0 2 is 40 2 /n where as 

Var {x 2 -^-j = Var (X 2 ) = E(X 4 ) - [E(X 2 )] 2 


^ _3_, 60 2 
n 2 n 


+ 0 4 



nj 


40 2 2 40 2 

n n 2 n ' 


Example 3.4.2 Let (Xy, ..., X n ) be i.i.d. Poisson (A) where M — ~ 

Poisson («A) is minimal sufficient statistic for A. Suppose we want to 
estimate yr(A) = = P[X = 0]. Define Ty(Xy) = 1 if X, = 0 and r,(X,) = 

0 if Xy * 0. Then E(Ty) = P[X) = 0] = e~\ and Ty € Uy. Now consider 

<P\(m) = E(Ty I M = m) 


= P[Xy = 0 I A/ = m] 

_ P[X x -OandZX,= m] 
P[X X, = m] 


n 


= />[X, = 0 and X X, = 

i=2 


m]/P(Z X, = m ). 



58 A First Course on Parametric Inference 


But X, and 1 X, ~ Poisson [(n - 1)A] are independent. Therefore 


(pi(m) = 


e~ x . g -(*»-DA[( n _ i)A]"Vm! _(n- 1 


e~ n,x (nA.) m /ml 
M 




m 


Thus 


<P i(«l 


= fl--^j , M = 0, 1, 2, ... 


On the other hand following methods of Section 3.1, we have if (p(M) e 


U ¥ then 


2 ©(m) e 
m= o 


-nx (**) 


m 


m\ 


= e" A , A > 0 


£ = £ ^0,_,)-VA>0 

m=o ml m=0 m! ' 


or 


Comparing coefficients of X m on both sides we have 


m 


<p{m)n n : (n - 1) 

7—, m = 0,1, 2, ... 


m! 


m 


(n - Y) m 

or (Pirn) =--—, m = 0, 1, 2, ... 

n 

What we have shown here is that within U v there exists only one unbiased 
estimator of iff which depends on (X,, ..., X n ) through minimal sufficient 
statistic M = X X t . This implies that in this case, for any T € U v the Rao- 
Blackwellization must lead to one and the same estimator (p(M) defined 
above since for every T e U ¥ , E(T I M = m) = (p(m) for each m. Therefore 
<p(M) obtained above must be MVUE. If such a situation were to occur in 

I - i 

Ex. 3.4.1 we could claim that X is MVUE of 9 and so is X 2 - — of 9 2 

n 

We assure the reader that this indeed is true and is a consequence of the 
property of completeness of the minimal sufficient statistic which holds in 
special models such as one-parameter exponential family. Observe that 
[N(9, 1), Be R \} and {Poisson (A), A > 0} are both one parameter exponential 
families. In the next section we study the completeness property and show 
how the problem of MVUE of \j/(G) can be resolved in case the model 
[L(x, 9), Be £2} admits a minimal sufficient statistic which has completeness 
property. 

Exercise 3.4.1 (1) Let (X it .... X n ), n > 2 be i.i.d. Bernoulli with P(X, = 1) = 6. Let 

7i(*i, X 2 ) = 1 if Xi = 1 and X 2 = 0 
= 0 otherwise 


Minimum Variance Unbiased Estimation 59 


Show that T[(X\, X 2 ) e and obtain Rao-Blackwellized version of T i say <pj(r) 

w.r.t. the minimal sufficient statistic T = E X t ~ Bin (n, 0). By solving the equation 

, 1 , 9 (r) (") 0,(1 ~ ^ (1 - 0 ), 0 < 0 < 1 , 

show that the Rao-Blackwellized version of T h <j 0 \{t) is the unique unbiased estimator 
of 0(1-0) i.e. <p(t) = 9 ,(r). 

(2) In Example 3.4.2 let \p{ A) = e~ K AI Define T,^) = 1 if X { = r and zero 

otherwise. Obtain <p|(m), the Rao-Blackwellized version of T { and show that it is the 
unique solution of 


£ <p{m)e-' a ^l = e ~ K 
w=« ml 


Tr^ 0 - 


(3) In Example 3.4.2 define r,(X, f X 2 ) = 1 if X\ = 0, X 2 = 0. Show that 7, e Ue~ 2X 

and Rao-Blackwellize 7,. Show that <p,(m) is the unique solution of E[(pim)] = e~ 2 \ 
V A > 0. 

(4) In Example 3.4.1 define Ti(X,) = 1 if a < Xj < b and zero otherwise. Then 

£(7]) = <P(b - 0) - <b{a - 0). Rao-Blackwellize T\ and obtain (p\{X ). In quality control 
problem if the product is declared as satisfactory if a < X { < b where a and b are 
specification limits of X, say diameter of the head of a screw, then &(b - 6) - <Ha - 6) 
represents the fraction of satisfactory items out of total production produced by the 
normal process with mean 0 and variance one. 

(5) Let (X|, ..., X„) be i.i.d. (7(0, 0) and let T = X ( „) be the minimal sufficient 
statistic. Let X (1) be the first component of the order statistic. Obtain 7i(X (1) ) such that 

£(7i(X (1) ) = 0 and Rao-Blackwellize it to obtain (P\(X (n) ) = E-tl x M . 

(6) Let (X|,..., X„) be i.i.d. exponential with mean 0, a model often used to describe 
the failure time of items of various types e.g. a T.V. tube or an electric bulb. Let 

yf(0) = e~ xnie = P[X > x 0 ] = R[x 0 ] the reliability of the item at X 0 , i.e. probability of 
failure free operation of the item until x Q . Define 7\(X,) = 1 if X, > ,x 0 and zero 
otherwise then T { e U r Rao-Blackwellize T { w.r.t. the minimal sufficient statistic 
T = X,. 


3.5 Completeness and Lehmann-Scheffe Theorem 

Let M be minimal sufficient statistic for [L{x, 0), 9e Q c /?,} and let y/(0) 
be an estimable function i.e. U^ is not empty. Now if T\ € U ¥ then by Rao- 
Blackwellization we obtain E(T ] I M) = e U r As seen in special 
cases (Ex. 3.4.1 or 3.4.2) (p\{M) is the unique element in which depends 
on X through the minimal sufficient statistic Af. Let {go( m » 0), 
0 e Q} be the corresponding class of pdfs of M under 6 then we wish to 

characterize the situation where the integral equation j (p{m) go(m, 0) dm = 

V40), V 0 g Q for given g 0 (m, 0) and y/(0) has a unique solution (p { {m ). 
Now observe that if <P\(M) e Uy, and (^(M) € U ¥ then E[(p x {M) - (^(A/)] 
= 0, V 0 e £2. If this implies that (p\{M) = with probability one for 
each 0 € Cl, then <jP,(Af) is essentially unique. Consider estimating a function 



60 A First Course on Parametric Inference 


yt(d) = 0, V e € £2 and let Uq denote the class of all unbiased estimators 
of zero which are functions of M. Then we require that the integral equation 


J %(M) g 0 (m, 6 ) dm = 0, V 6 e £2 

holds iff %(M) = 0 w.p. 1 under each 0 e £2, that is only unbiased estimator 
of zero in Uq is the estimator identically equal to zero. The reader will 
notice the obvious analogy with the solution of system of linear equations 
where we have to solve Ax = b and we initially consider the homogeneous 
linear equations Ax = 0. When the matrix A is of full rank then the equation 
Ax = 0 has the unique solution x = 0 and consequently we have also the 
unique solution for the equation Ax-b given by x = A~ l b. Here of course 
we assume that A is a square matrix m x m and x and b are mx 1 vectors. 
We therefore define the completeness of a minimal sufficient statistic M as 
follows. 

Definition 3.5.1 A minimal sufficient statistic M for the class of pdfs 
[L{x, 0), 0 e £2} is said to be complete if E e [(p{M)] = 0, V 0 € £2 implies 
that (p{M) = 0 with probability one under each 0 e £2. 

Example 3.5.1 Let (.Xj, ..., X n ) be i.i.d. 5(1, 0) then as seen earlier M = 
JXj ~ B(n, 0) is minimal sufficient for 0. We now show that M is complete. 
Let (p(M) be an unbiased estimator of zero then 


n 


X (p(m) 

m=0 



(T(l - &)’> 


-m 


= 0 0 < 0 < 1 


Now L.H.S. of above equation is a polynomial of degree n which vanishes 
identically for 0 € (0, 1). Therefore coefficient of any power of 0 must be 
zero. Consider the constant term (coefficient of 0°). This is included in 


<P(0) g 0°(1 - 0)" only and is given by (p{ 0) as (1 - 0)" = 1 - Q + 

0 2 + ... + (- 0)”. Therefore (p( 0) = 0 and X (p(m) 0"'(1 _ Q) n ~ m - o. 

Cancelling one power of 0 and again looking at the constant term on LHS 


we have ^>(1) - 0. Repeating the process we arrive at the result <p(m) = 0 for 
m = 0, 1, 2, ..., n or (p(M) = 0 w.p. 1 under each 0 € (0, 1). 


Example 3.5.2 Let (X h ..., X n ) be i.i.d. P(A) then M = XX, ~ P(«A) and 
if we consider the identity E[q>{M)] = 0, V A > 0 we have 


X 

m=0 


(p{m) e 




(nX) m 

ml 



V A > 0 


X 

m =0 


(p{m)n"' 

ml 



V A > 0. 



Minimum Variance Unbiased Estimation 61 


Now LHS is a power series in A (analytic function of A) which vanishes 
over a non-degenerate interval (0, <»). Hence coefficients of each power A” 1 , 

rn(m)n m _ r , 

, n = 0, 1,2, ... should be equal to zero and therefore-j— = 0 for eac “ 

m - 0, 1,2, .... As * 0 for any m we have <p(m) = 0 for each m, and 
thus the minimal sufficient statistic M = IX/ is complete in this case also. 
Example 3.5.3 Let (X!, .... X„) be a random sample of size n from a power 

series distribution with fix, 9) = , x = 0, 1, 2, ... and ft = (0, p). 

As seen earlier the distribution belongs to one parameter exponential family 
with minimal sufficient statistic M = 'LX i having a power series distribution 

given by 


go(m, 9) = 


c(m)9 m 

[bm n ’ 


m - 0, 1, 2, 


♦ • • 


Now 


E[(p(m)]= I <p(m) 

m=0 


c (m) 6 m 

[W)r 


= 0 


or 


i.e. 


£ <p(m) c(m)0 m = 0, V 9 € (0, p) which implies that 

m=0 

(p(m) c{m) = 0, m = 0, 1, 2, ... 

<p(m) = 0 when c(m) > 0 or <p(M) is zero w.p. 1 


Note that P(M = m) = 0, V 9 e (0, p) if c(m) = 0. 

It is not very easy to prove the completeness property of the minima 
sufficient statistic M which is a continuous r.v. with pdf {g 0 0n, 9), 6e Q.}. 
We will indicate below a heuristic proof of completeness of X, the sample 
mean in case of a random sample of size n from N(9 , 1) which can be 
generalized to the case of random sample of size n from {/(*, 9), 9 e £2} 
which forms a one parameter exponential family. 

Example 3.5.4 Let (X„ .... X„) be a random sample from {N(6,1), 6e R,), 

I n \ n(x - 9) 2 

then X is minimal sufficient with g 0 (x, 9) = expj- 2 
x e /?|, 9 € R\. Let (p(x) be an unbiased estimator of zero so that 
E[(p{x )] =0,VfleX| which implies that 


or 


f <p(*)ex] 

Jr, 


n(x - 9) 


dx = 0, V 9 e R\ 


nx 


r.2 


e 6x dx = 0, 


(3.5.1) 




62 A First Course on Parametric Inference 


Define <p+(x) = Max (<p(x), 0) and <jp_(x) = Max (0, - <p(x)) then 
(p(x) = <p + (x) - <p_(x) and I <p(x) I = <p + (x) + <p_(x). Recall that £[<p(x)] 
exists iff E[l <p(x) I] exists. Thus 


J <p + (x) exp j- y-j </x = J <p_(*) exp |- y-j 


</x. 


In particular at 6 = 0 we have 


J (p + (x) exp j- y-J dx = J <p_(x) exp |- y-j dx = C 0 . 


Observe that h+(x) = 


<P+(*) exp j - y-1 


> 0 and J h + (x) dx = 1 and /* + (x) 


is a pdf over R h Similarly we have /r_( jc) is also a pdf over /?j. Then we 
have V 0 € /?j. 


< 2 + ( 0 ) 


=J A +' 


(a:) e 9x dx = <2_(0) = I /i_(x) dx 


■J 


„ 6x j— 


(3.5.2) 


However Q + (0) is the mgf of the pdf given by h + (u) and Q_(0) is the mgf 
of the pdf given by h_(u) and these two mgfs agree over an interval of R { 
containing origin. We now have a theorem which asserts that if two mgfs 
agree over an interval containing origin then their pdfs must be same or 
h+(x ) = h_(x), V x e /?j. This however implies that <p + (x) = (p_(x ) for 
x e /?] from which it follows that (p(x) = 0 for V x e R { or (p(X) = 0 
w.p.l under each 6 e Cl. 

Following the similar arguments one could show that for one parameter 

n 

exponential family the minimal sufficient statistic M = X k (x,) is complete. 

1=1 

Observe that 

g 0 (m, 6) = exp {u(6) m + nv(6) + w^m)}. 


Using now the fact that ^ * 0 we make 1 : 1 transformation of the 
parameter space 6 —» rj = u(0) then we have 


g' 0 (m, v[) = exp {rim + 1 ^( 77 ) + w^m)}. 
and we have for all 77 e 

J ( p(m)g' 0 (m , rf)dm = J (p(m ) exp { 77 m + 1 ^( 77 ) + w»i(m)} dm = 0 

J h + {m)e 11m dm = J h_(m ) e^ m dm, V 77 e Qj (3.5.3) 


Thus 


where 


Mm) = g*- (m)eXp(>Vl(m)l 

c 0 


Minimum Variance Unbiased Estimation 63 


and 


h (m) = P-( m > ex P {wi(”Q} 

Co 


and c Q is the norming constant which makes /i + (m) and h_{m) as pdfs. 
Equation (3.5.3) implies that mgf of h+(m) and agree over an interval 
containing zero and thus h + (m) = h_(m), V m which implies that (p+(m ) = 
<p_(m), V m and hence <p(m) = 0 w.p. 1. 

In the heuristic proof given above we have slurred over many fine points. 
Indeed proving completeness of M - 'Lk(X i ) in one parameter exponential 
family is essentially same as proving uniqueness property of Laplace transform 
of an integrable function which plays a crucial role in applications of 
Mathematics to Engineering and Physics among others. It also indicates 
that property of completeness is a deep property yielding the following 
powerful theorem due to Lehmann and Scheffe. 


Theorem 3.5.2 Let {L(x, 0), 0 e Q} admit a minimal sufficient statistic 
M which is complete. Suppose Uy is not empty then MVUE of y/(0) is the 
unique element <p(m) e U r 

To prove the theorem we observe that by Rao-Blackwell theorem the 
search for MVUE can be restricted to Uy, unbiased estimators of y/(Q) 

which are functions of M. In view of completeness of M, Uy consists of 
a single element (p(M) only and it follows that (p(M) is MVUE. 

Example 3.5.5 As X ~ N(0, 1 In), 0 e /?j is an exponential family it 

follows that X is minimal sufficient and complete. Therefore, by Rao- 
Blackwell Lehemann-Scheffe theorems (RBLS) using Example 3.4.1 it follows 


that X is MVUE for 6 and X 2 - is MVUE for 0 2 . Similarly, as £X, ~ 
IP (nX) is minimal complete sufficient statistic, for a random sample of size 
n from IP (A), from Example 3.4.2, it follows that = X is MVUE for A 


and <p(M) = 


n - \ \ 


n 


M 


is MVUE of e \ The minimal sufficient statistic 


) 


2 Xj = M in sampling from { b( 1, 6), 0 < 6 < 1}, {IP (A), A > 0 } • {exponential 
distribution with mean 6} can be shown to be complete as the pdfs belong 
to exponential family. Therefore the unbiased estimators derived by Rao- 
Blackwellization in the Exercise 3.4.1. except (5) will yield MVUE of the 
corresponding function if/. 


Example 3.5.6 Let (A), ...» X n ) be random sample of size n from the 

Pareto distribution f(x, 6) = > 1. G>0. Then as Pareto distribution 

belongs to one parameter exponential family M = X log X, is minimal 

complete sufficient statistic with pdf go(m, 6) = ~jrjn) m > 0, 


64 A First Course on Parametric Inference 


0 > 0 . Suppose y/( 9 ) = 0. Then we observe that Zs[m r ] = ^7 for 

any r > 0, as well as for any r such that n + r > 0 or for r > -n. Therefore 

so long as n > 2, E(m~ l ) = 0 or -^T" 

1 ( n ) M 


e U e and <j£>(Af) = ^-L 

log Z, 


is MVU of 0 . Indeed q> r (M) = M r would be MVUE of ^7 for 


r > - n or -= ,f (n) . M r would be MVUE of — 
/ r(n + r) $r 


any 


Example 3.5.7 Let (Zj, ..., Z„) be a random sample of size n from a 
N geometric distribution with pmf f(x, 9) = 9(1 - 0) x , x = 0, 1 , 2 , ..., 0 <9 
< 1. As {f(x, 0), 0 e £2} is a one-parameter exponential family (in fact a 
power series distribution) we have M - £Z,- is minimal sufficient statistic 
with pmf 

g 0 (m, 0) = ( W *"f 1 )0"(l - 9) m , m = 0,1, 2 , ..., 0 < 0 < 1 

which itself belongs to one parameter exponential family. Suppose we want 
to estimate 0 then by RBLS theorems we must solve the equation 

£ q> (m) { m + n 7 0 " (1 - 9) m = 0 

m -0 V rl — 1 J 


or 


J 0 (P ( m ) ^ n _ j ^ 0 m = (1 - 0) (H l) where 0=1-0 


= I 

m=0 


- n + 1 
m 


(- 1)”'0 


/W 


Equating coefficients of 0™ for each m we have 


*«>=i » +1 Wf\ + -V' 


m + n - 2 
m 



m + n - 1 
n - 1 


= ^+«-r'” = 0 ' J- 2 - - 


Therefore we have p(M) = M = 0 . 1 , 2 , ... is MVUE of ft 

We note that sometimes in SQC, inverse binomial sampling method is 
used i.e. rather than taking a fixed lot for inspection and then determining 
number of defectives in the lot, we sample until n satisfactory/defective 
items are found. The sample size in excess of n, say M has a negative 

binomial distnbunon given by g 0 (m, 9) where 0 is the probability of item 
Being satisfactory/defective. 


Minimum Variance Unbiased Estimation 65 


We recommend to readers to Rao-Blackwellize 7,(X ( ) = 1 if X, = 0 and 
zero otherwise to obtain the above MVUE. The basic property we have to 
use is the fact that if X h ..., X k are i.i.d. geometric then (A^ + X 2 + ... + X k ) 
is negative binomial with pdf 

*(*• r - 6) = (* k - 7 ') 0,: 0 - 0 ) r ' r = 0, 1, 2. - 

We also point out that RB and LS theorems are sometimes paraphrased 
as a single theorem stating that. 

THEOREM 3.5.3 (RBLS Theorem): If M is minimal complete sufficient 
statistic for {L(x, 0), 9 <= Q] then any statistic (p(M) is MVUE of its 
expectation E e [(p(M)]. 

As in the case of one parameter exponential family we can show that in 
case of one parameter Pitman family where a one dimensional minimal 
sufficient statistic T exists then T is complete. We again give a heuristic 
proof in case of X (w) = Max (X), ..., X„) where (X^ ..., X„) are i.i.d. U( 0, 
9), 9 > 0. Let <p(X (n) ) e Uq then 

r e 

I V (■*(«)) ^(n) d*(n) ~ 0 V 9 € (0, °°) 

Jo 

or equivalently for each 9 > 0 



or //+(0, 9) = H_{ 0, 9) where H + { 0, 9) and H_{ 0, 9) is the integral of <p + (x {n) ) 
and (p.(x (n) ). If (p + and (p_ are continuous, by fundamental theorem of integral 
calculus H + (0, 9) = H_(0, 9) implies that <jo + (9) = tp_(9) for each 
6 > 0 or ?+(*(«)) = <?-(*(«))> x (n) > 0- Therefore (p(x {n) ) = 0 for each x {n) > 0. 
If (p+ and <p_ are not necessarily continuous one can use measure theoretic 
arguments based on the fact that if two non-negative set functions agree 
over all intervals of (0, ~) then they must agree on each Lebesgue set of 
(0, oo) a nd in turn the derivatives of such set functions w.r.t. Lebesgue 
measure must agree on (0, <*>) or (p + = <p_. Therefore, we can claim that 
0(*(n)) = 0 on (0, *o). 

Again here we observe that completeness of minimal sufficient statistic 
is a deep property and its proof requires basic results in measure theory and 
integration and theory of analytic functions. 

Example 3.5.8 Using the fact that X (n) is minimal complete sufficient 

statistic for 9 in U( 0, 9), it follows from RBLS theorem that 5-+1 v is 

n w 

MVUE of 9. Suppose we want to obtain MVUE of the variance of 



66 A First Course on Parametric Inference 


U( 0, 0) distribution. Then observing that E(X^) = n + f ^ follows that 
*< 2 „> is MVUE of 

Similarly, in the case of sampling from exponential distribution with 
location 0, i.e. fix, 9) = exp {- (x - 0)}, x > 9 we have X ( d is minimal 

sufficient statistic and X m - — would be MVUE of 9. 

u; n 

Consider a random sample of size n from U(- 9, 9) where 9 > 0. Then 
as seen in Example 2.6.1 we have T = Max(- *0). X ( „)) is minimal sufficient 

with pdf g 0 (t , 9) = , 0 < t < 9. T is thus complete. Then as E(T) = 

—~~~r 9 we have MVUE of the upper limit 9 is - +- -■ T and that of lower 
n + 1 n 

limit - 9 is - - + - T. 

n 

Example 3.5.9 As an example of a minimal sufficient statistic which is not 
complete consider a sample of size 2 from the Cauchy distribution with pdf 

f(x, 9) = —--- T , x <= Ri, 9 e R y . As seen in Example 2.4.3, the 

order statistic (X (1) , X (2 )) is minimal sufficient for 9. We now show that this 
two dimensional minimal sufficient statistic for the one dimensional parameter 
9 is not complete. Consider T = X {2) - X (1) then T = (X (2) - 9) - (X (1 ) - 0) 

= Yi 2, - Y< i) where (X (1) , F (2) ) is the order statistic of a sample of size two 
from Cauchy distribution with 9 - 0. As the distribution of T does not 
depend on 9 and P e [a < T < b] = p does not depend on 9, for any given 
interval (a, b). Now define <p(X (1 ), X (2) ) = l- /?ifa<7'<b and <p(X (1) , X (2) ) 
= - p if either T < a or T > b then E[(pX ({ ), X (2) ] = 0 for each 9 e R^. This 
shows that the minimal sufficient statistic (X (1) , X (2) ) is not complete. We 
remark here that for a sample of size n from Cauchy distribution with 
location 9, we can show that the order statistic (X (1) , ..., X (/J) ) is minimal 
sufficient. Consider = X in) - X (1) or in general any linear function of the 
components of the order statistic T(l) = X /,X(,) with X /, = 0. We can write 
no = i iiYu) where y m = Xm - e and therefore (T (1) , ..., Y {n) ) has the 
distribution of order statistic of a sample of size n from 9 = 0 and as such 
T(/) has distribution independent of 9. We can now select a function <j£>(X (1 ), 
..., X (;l) ) = <p(X //X (/) ) with zero expectation to claim that the order statistic 
of a sample of size n is minimal sufficient but not complete in the Cauchy 
case. 

The problem of MVU estimation is satisfactorily solved when the model 
is such that it admits a minimal sufficient statistic which is complete. This 
concept was introduced by Lehmann and Scheffe (1950) and they also 
showed that tor most of the commonly occurring situation a search for a 
sufficient statistic which is complete would be enough i.e. minimality in 


Minimum Variance Unbiased Estimation 67 


fact follows. We will not go deeper in the issue but instead in the next 
section obtain a necessary and sufficient condition for 7 to be MVUE for 
without any reference to sufficiency and completeness. The condition 
can be used in some situations where minimal complete sufficient statistic 
docs not exist. 

3.6 A Necessary and Sufficient Condition for MVUE 

Let (X„ X* X «) have J° int P df given by {L(x, 9), 6 e ft} and U ¥ = 

{T(x) I E(T(x)) = yKO), V 6 e ft} and U 0 = {u 0 (x) I E(u 0 (x)) = 0, V 9e ft}. 
We assume that Var (T(x)) < oo and Var (u 0 (x)) < oo. 

THEOREM 3.6.1 A necessary and sufficient condition that 7* in U ¥ is 
MVUE of yKO), is given by 

Cov (' T *, «q) = 0, V 0 € ft, V « 0 e U 0 (3.6.1) 

Assume that (3.6.1) holds and consider any 7 e U ¥ then as T* e U r we 
have E(T* - 7) = 0, V 0 e ft and therefore Cov (7*, 7* - 7) = 0, V 6 e ft. 
Hence it follows that Cov (7*, 7) = Var (7*). Now 

Var (7* - T) = Var (7*) + Var (7) - 2 Cov (7*, 7) 

= Var (7) - Var (7*) 

> 0, V 6 e ft. 

Thus for any 7 e t/^, Var (7) > Var (7*) V 6 e ft and therefore 7* e U ¥ 
is MVUE of v<0). This completes the proof of sufficiency of the condition 
(3.6.1). 

To prove necessity we follow the method of reductio ad absurdum. Suppose 
that T* is MVUE but (3.6.1) does not hold. Then there exists a w 0 e U Q and 
a 6q e ft such that Cov (7*, uq) * 0 at 0 O . Now for any A real, 7* + Am 0 
g f/y, and 

Var (7* + Xuq) = Var (7*) + A 2 Var (w 0 ) + 2A Cov (7*, w 0 ). 

Now let A(, 0 = - Covfl 0 (7* M O )/Var 0o (M O ), then at G = 0 O 

Var (7* + Am 0 ) = Var (7*) - [Cov (7*, n 0 )] 2 /Var (w 0 ) 

which shows that Var (7* + A 0O m 0 ) < Var (7*) at 0 = 0 O contradicting the 
datum that 7* is MVUE. This completes the proof of necessity of the 
condition (3.6.1). 

In general, determination of U 0 class of unbiased estimators of zero is 
most difficult. Note that if we are able to specify Uq then U ¥ = { T\ + ocuq 
I «o e Uq, a € R \} or U ¥ is completely specified if we know at least one 
unbiased estimator 7, of y<0) and Uq. This situation is similar to that where 
the set of solutions of non-homogeneous linear equations Y - AX + b 
consists of yi + Ayo> where yj is a particular solution and yo is a solution 



68 A First Course on Parametric Inference 

■ _ 0 v- AY We now consider an example 

of the homogeneous linear equations K - «•»» 

to illustrate the application of the Theorem 

„ - , . , ,, V be a discrete r.v. with pmf given by P[X = - 1 ] = a, 

me*™] - (i - «)V,* = o, 1, 2, .... Where 0 < a < 1- Consider a sample 

of sire one only. Then x±y d)IUy. a) is independent of a. We claim 

that x-y iff x = y. Let * = - 1 and x * y. Then L(- 1, a)/Uy, a) = 
Of 1- '' f or an y y = 0, 1,2,... and depends on Of. Further let x * y be two 


non-negative integers then x-y iff does not depend on a or * = y. 
Thus X itself is the minimal sufficient statistic for Of. 

°° 

We now determine U Q . Let u(x) e U 0 , then u(- 1) a+ u(x) (1 - a) a x 

= 0 for every a e (0, 1). We now use the technique of equating coefficients 
of powers a Y , y= 0, 1, 2, .... The constant term is u( 0) and hence u( 0) = 0. 
The coefficient of a is u(- 1) + «(1) = 0 or u(- 1) = - «(1). The coefficient 
of a r , y > 2 is given by 

u(f) - 2u(y- 1) + u(y- 2) = 0 or A 2 («(tf) = 0. 


d 2 y 


dx 2 


= 0 implies 


This is a finite difference equation of order 2 and as 

that y is linear or y = a + bx, A 2 u(y) = 0 implies that for y > 2, u(f) - 
a + &y, where constants a, b are determined by the initial conditions 
u(0) = 0 and u(- 1) + w(l) = 0 i.e. <2 = 0 and b can be any real number. 
Thus u(y) = by, y= - 1,0, 1, ... b e is the general solution and 
U 0 = {cX I c e /?j}. 

Otherwise we can determine u(y ) recursively. Now w(0) = 0 and let 
m(1) = - u(- 1 ) = c. Then as A 2 u 2 = 0 we have u(2 ) = 2 c and m(3) = 2u(2) 
- u( 1) = 3c, and so on. Thus we are led to U 0 = {cX Ice R Y ) and the 
minimal sufficient statistic X is not complete. 

When minimal sufficient statistic is not complete strange situations occur 
as we now illustrate. Consider estimation of the indexing parameter a. 
Then T^X) = 1 if X = - 1 and zero otherwise is unbiased for a and U a = 
[T\(x) + Xu 0 I X e R\, uq e Uq}. But as U 0 = {cX Ice R^} we have U a 
= [T x (x) + cX I c e/?,}. T* e U a is MVUE of a iff Cov (T*, cX) = 0, 

V c e R x . But we have T* = ^(x) + c*X, for some constant c* and Cov 
(7*, cX) = c*c Var (X) + c Cov ( X , T x (X)). We can show that Cov (X, T { (X)) 

= - a and Var (X) = -p—Therefore c*c Var (X) + c Cov (X, T^X)) = 0, 

V c e /?!, holds iff c* = -^-1—This choice of c* depends on a the 

^ r “,!! r d is n0t Same for a11 values of «• Thus although U a is not 
empty MVUE of a does not exist. 



Minimum Variance Unbiased Estimation 69 


On the other hand consider y(a) = (1 - a) 2 = p[X = 0]. Then T 2 (x) = 

1 if X = 0 and zero otherwise belongs to U ¥ Further for any c e R\ we 
have for any a, Cov (T 2 (X), cX) = 0 and T 2 (X) is MVUE of (1 - a) 2 . 

Next we consider some important consequences of the Theorem 3.6.1, 
namely uniqueness and additivity of MVUE. 

Theorem 3.6.2 Let T\ € U ¥ be MVUE of y/(0) then T\ is essentially 
unique. 

Let T 2 e U ¥ be also MVUE for y<0). Then T x -T 2 e U 0 and Cov (T h 
T\ - T 2 ) = 0 = Cov (T 2 , T { - T 2 ). i.e. Cov (T h T 2 ) = Var (T { ) = Var (T 2 ) 

V 11. This implies that 

Var (7\ - T 2 ) = Var (ft) - 2 Cov (T h T 2 ) + Var (T 2 ) = 0, V 0 e Q or 
T\ = T 2 w.p. 1 for each 0 e ft. 

Theorem 3.6.3 Let T, be MVUE for y/}(0), i = 1, 2, ..., it. Then £ a,T ( - 

i=l 

* 

is MVUE for y<0) = E a,y/,(0). 

1=1 

Let u 0 e U 0 then Cov (T,-, m 0 ) = 0, i = 1, 2, ..., fc V 0 € ft. Therefore 
Cov (E a,Tf, m 0 ) = E a,- Cov (7), u 0 ) = 0, V 0 e ft, V w 0 e (/ 0 and E a{T t 
is MVUE of E a-MO). 

Theorem 3.6.4 Let [L(x, 0), 0 e ft} be such that M is minimal sufficient 
and complete and T(x) is any statistic such that E(T(x)) does not depend on 

0 and Var ( T) - <7^(0) < <*=. Suppose that ((KM) is a statistic with finite 
variance then Cov (((KM), T(x)) = 0. 

By RBLS theorem ((KM) is MVUE of E e [((KM)] and [T(x) - /c 0 ] € U 0 
where k 0 = E^T(x)), V 0 e ft. Hence by Theorem 3.6.1 we have 
Cov [((KM), T(x)] = 0, V 0 e ft. 

We now consider some applications of the above theorems by way of 

examples. 

/ 

/ 

\Fxample 3.6.2 Let (Xj, X n ) be a random sample of size n from the 
double exponential distribution (Laplace) with zero mean and scale 0 r . 

Then f(x, 0) = exp {- I x 1/0}, x e R lt 0 > 0 belongs to one parameter 

n 

exponential family with M = E I X,- I as minimal complete sufficient 

1=1 

statistic for 0 with g 0 (m , 0) = — 7 — e ~ m ' 6 m " _1 - w > 0, 0 > 0. Suppose 

r(n)0" 

we want to estimate 0 r then observing that E(M r ) = ^ we have 

is MVUE for ff, r = 1, 2, 3, .... Now by Theorem 3.6.3 





70 A First Course on Parametric Inference 


Z a r would be MVUE for Z 

r=l I (n + r) r=l 

of any polynomial of degree k in 6 using Theorem 3.6.3. 

n 

Observing that E(X) = 0 we have T{1) = Z /,X/ e U 0 and by Theorem 


a r 6 r . We can thus obtain MVUg 


3.6.4, Cov Z IjXjJ = 0 or Z I X t I and Z l/X,- would be uncorrelated. 

Note that Var (T(l)) = 2 Z /, 2 6 2 depends on 6 and the distribution of T 
depends on 0. 

Since distribution of X is symmetric we can take any odd function T { 
such that T](- x) = - Ti(x), x e R l (e.g. Sin x ) with finite variance. Then 
Theorem 3.6.4 asserts that Cov (M, T x (X,)) = 0 for each i = 1, 2, ..., n or 


Cov (M, Z IjT^Xj)) = 0. For example we can take Tj(X/) = X 2r \ r>\ any 


odd power of x to claim that Cov \ M, Z / ( X, 2r_1 J = 0, V 0 g O. Taking 

n 

(p(M) to be any power of M, we have Cov (M\ Z /, X, 2r_1 ) = 0 as M s would 

1=1 

be MVUE for E(M S ) = — L. . - 6 s . Using the first result of this example 


r(n) 

we have Cov [ Z a s M s ,X, /, X, 2r_1 l = 0. 

' S=\ • • ' 


i=\ 


Example 3.6.3 Let (Xj,..., X n ) be i.i.d. N(6, 1) then X is minimal complete 
sufficient statistic. Let <?,• = X,~ X be the i-th residual. Then E(eft = 0 and 
e, g U 0 . Therefore Cov (X, e f ) = 0 for any i = 1, 2, ..., n. We note that not 


only E(ej) = 0 but e t - N\ 



and its distribution is independent of 6. 


Again joint distribution of X and X, - X is bivariate normal with mean 


vector (6, 0)' and covariance matrix 


a 

n 

0 


0 

n - 1 


\ 


and X and X, - X are 


V n J 

stochastically independent a much stronger result than Cov (X/ - X, X) = 0. 
Similarly let S 2 = Z (X, - X ) 2 then S 2 ~xl-\ and its distribution is independent 

of 6 and therefore Cov (X, 5 2 ) = 0 or using the fact that <p(X) is MVUE of 
its expectation and r\(S 2 ) has expectation independent of 6, it follows that 

Cov (<p(X), 77 ( 5 2 )) = 0, Of course we already know that X and S 2 are 
stochastically independent and therefore (p(X) and ri(S 2 ) would also be 
stochastically independent implying thereby Cov (<p(X), ri(S 2 )) = 0. 

The above example is an illustration of a theorem due to Basu (1955) 
which states that 


Theorem 3.6.5 (Basu’s Theorem). Let M be minimal complete sufficient 




Minimum Variance Unbiased Estimation 71 

statistic for {L(x, 6), 6 e Q} and let Tbe a statistic such that the distribution 
of T does not depend on 6. Then M and T are stochastically independent. 

Consider P(T < t) = F(t). Define an indicator function I{x ) = 1 if 
T(x ) < t and zero otherwise. Then E e (I{x)) = F{t ) for all 6 e Q. Now 
consider rj(m) = E e [I(x) I M = m] = P(T(x) < t I M)). Then E£t){M)) = 
EdJ{x)) = F(t) or EdJ\{M) - F{t)) = 0 V 0 e Q. Since M is complete this 
implies that r\{M) = F(t) w.p. 1 under each 6 e Q. Therefore P[T < 1 1 
= m ] = P[T < t ] for each pair t, m and for every 6 e Q therefore T and 
are independent. 

A statistic T whose distribution does not depend on 6 is called as ancillary 
statistic and Basu’s theorem can be paraphrased as saying that if M is 
minimal sufficient and complete statistic and T is ancillary then T and M 
are stochastically independent. 


§ s 


4 

Simultaneous Estimation of 
_Several Parameters 


4.1 Optimality Criteria 

In this chapter we consider the problem of simultaneous estimation of say 
a vector of k functions y/(0) = (i/q(0), ...» 1/4(0))', k > 1 where functions 
{1, V4> •••> Wk) are linearly independent over Q the parameter space and 
where 0 = (0j, 0 m )', m > 1 is itself a real or vector valued parameter. 

The vector parametric function if/ is estimated by a vector valued statistic 
T - (T h ..., T k y. The criterion of unbiasedness for real valued function 
V^0) considered in Chapter 3, i.e. the case k = 1, can be easily extended 
by defining vector valued statistic T to be unbiased for vector valued if/ii 

E(Tj) = W(0), i = 1, 2, k, V 0 e £2 (4.1.1) 

Thus T is unbiased for if/ if it is componentwise unbiased. For example if 
we have a random sample of size n from N(B, a 2 ) then as E(X) = 0 and 


E(S 2 /n - 1) = <r 2 where S 2 = I (X, - X) 2 we have | X, —— X (X -X) 2 ' 

n - 1 ) 

is unbiased for (0, o 2 )'. From results of Chapter 3 we know that X is 
MVUE for 0 and S 2 /(n - 1) is MVUE for <T 2 . However we must now define 
the optimality criterion of the “Minimum Variance” for a vector valued 


statistic T unbiased for vector valued function if/. Such a criterion must be 
related to the Variance Covariance matrix of T. Thus we assume that the 
estimator T is such that its Variance Covariance matrix Mj exists and is 
positive definite (pd). This is analogous to the requirement in case k — 1 
where we assumed that T e U v is such that Var (7) = g 2 {6) exists and is 
positive. In case of k = 1 for Tj, T 2 e U ¥ , T x is preferred to T 2 if Var (T,) 
< Var (T 2 ), V 0 e a In a similar way if T { = (T n , ..., T lk )' and T 2 = (T 21 , 
T lk ) are both unbiased for m = (^(0), ..., y/ k (0))', T{ is preferred 
to T 2 if M Tl - M T{ is non-negative definite (nnd) for every 0 e Q. 


Definition 4.1.1 
V T e U ¥ . 


T* e U ¥ is MVUE for if/ if Mj - M-p* is nnd V 0 e Q, 


We note the following important consequences of the above definition. 
Since Mr - Mj. is nnd it follows that for any i, Var (T*) < Var (r,), V e e Cl 

where T h T* 6 V vr Thus each T* is MVUE of W (0) and the MVUness 




Simultaneous Estimation of Several Parameters 73 

property of T* for V'as defined above implies componentwise MVUness of 
T* for V- We have already seen in Chapter 3 that if T*, i = 1 , 2 , k are 

MVUE for W, i = l,2, k, then any linear combination T* = Z a,T* is 
* " « ' 
MVUE of \ff a = <Wi- Therefore for any T <= U ¥ 

Var (T*) = Var (T a ) = a'M^ 

One can conversely define T* to be MVUE for y/ if for any a e R k and 
any Te U r a'Mja > a'Mj* aVOefl, which implies that (M r - M^) is 
nnd. Therefore an equivalent definition of MVUness of 7 ’* can be given as 

Definition 4.1.2 T* e U ¥ is MVUE for y/ iff for any a e R k and any 
T e U ¥ • a'Mj^a < a' M T a, V 0 e 

We note that the definition of MVUness of T* for y/ immediately implies 
that for any T € U ¥ and V 9 e ft 

(i) I Mj* I ^ I M.j I 

(ii) TriMj*) < TiiMf) 

(iii) < X x (M t ) 

where X\{A) denotes the largest eigen-value or the largest characteristic 
root of a positive definite matrix A and I A I denotes its determinant and 
TifA) denotes the trace of A. Historically speaking the optimality of vector 
unbiased estimator T was defined by Cramer (1946) using minimization of 
the determinant of the Variance-Covariance matrix I Mj I, which was later 
called as generalized variance. Cramer arrived at this criterion through the 
concept of an ellipsoid of concentration which we shall consider next. The 
geometrical interpretation will also throw more light on the optimality criteria 
defined by (4.1.2), (4.1.3) and (4.1.4) which are termed as D-optimality, T- 
optimality and E-optimality respectively. The definition (4.1.1) or equivalently 
(4.1.2) of MVUness will be called as M-optimality. 

4.2 Ellipsoid of Concentration 

Let T be a real statistic unbiased for a real parametric function i f/{9) with 
variance cr£( 0 ). Then as observed in Chapter 3, the MVUE T* of yf has 
highest concentration of probability around y/, as a consequence of 
Tchebychev’s inequality. More formally for each T e U ¥ consider a r.v. W 
which is uniformly distributed over the interval (y/( 0 ) -V 3 cr r ( 0 ), y/(d) + 
^ (&)) so that E(W) = y<0) and Var (W) = a £(0). Then W* the r.v. 

corresponding to the MVUE T* has the property that the length of the 
interval of the range of W* 2^3 <r P ( 0 ) < 2^3 07 ( 0 ) which is the length of 
the interval of the range of W, corresponding to any other T € U ¥ . 


(4.1.2) 

(4.1.3) 

(4.1.4) 



74 A First Course on Parametric Inference 


We now generalize this property for a ^-dimensional vector valued statistic 
T unbiased for a ^-dimensional vector parametric function y/. For each such 
Te U ¥ we consider a ^-dimensional random vector W, which is uniformally 
distributed over a ^-dimensional bounded connected set S c so that 
E(W) = yf and M w = Mj i.e. W is centered at y/ and has same Variance 
Covariance matrix as that of T. However in there are many choices for 
such a region S whereas in /?, such a region must only be an interval of the 
type (y( 6) - a, y(Q) + a). In R 2 we could take rectangle, a circle or an 
ellipse. The choice of a rectangle or a circle is not as suitable for manipulation 
as an ellipse, and therefore we take S to be an ellipse with centre y/(0). 
Below we illustrate the algebra/calculus for the case k = 2 which can be 
generalized for k > 3 in a straight-forward manner. 


Let A = 




be a symmetric positive definite matrix and let 


\^12 a 22 ) 

Fj = (Wi - ^(0)) and Y 2 = ( W 2 - ^(0)). T* 1 ® 11 the r - v - (^i» Y i)' is assumed to 
be uniformly distributed over the region bounded by the ellipse defined by 


a\\y\ + 2a i2 y 1 y 2 + a 22 y 2 = c 2 or y'Ay = c 


- J. 


(4.2.1) 


The elements of A and the value of c 2 are to be chosen such that M Y = M T . 
Since A is symmetric pd there exists an orthogonal matrix P such that 

O'' 

P'AP = (A) = and the corresponding linear transformation is 

V 0 a 2 J 

Z= Py and the ellipse given by (4.2.1) now becomes ellipse in the standard 
form 




= 1 


(4.2.2) 


t * 

Let wj = -— zi, u 2 = 


#7 


z 2 then (4.2.2) becomes the circle 


u 2 + u 2 = 1 


(4.2.3) 


Now as the area of the ellipse defined by (4.2.1) is 
(Yb Y 2 Y is given by 


cht 

Ul 1/2 


the joint pdf of 


/Cvi» yi) = — 2 —> y'Ay < c 2 . 

C z 7t 

By rule of transformation and using the fact that P is an orthogonal matrix, 
the joint pdf of (Zj, Z 2 )' is given by 


l A i 1/2 

f(z b Z 2 ) = —2 -’ 

c L n 

and that of (£/,, U 2 )' is given by 


+ Al£i < 1 

2 ’ 9 — t 

C l C l 


Simultaneous Estimation of Several Parameters 75 


f(u, u 2 ) = 1 In over the circle u\ + u\ < 1 


Note that 


dz\d.Z2 - 


A jA 2 


du\du 2 and A 1 A 2 = I A I. 


Now we have 2s(t/) = 0 which implies that £(Z) = 0 and E(Y) = 0. 

1 • 1 
Further, My = -^ I, where I is the identity matrix. Therefore, M z = y A = 

d PA~ l P' since P~ l = P'. Now M z = E(ZT) = E(PYY'P') = PM T P'. 
Hence the matrix A must be selected such that 


y PA~ l P' = PM t P‘ 


(4.2.4) 


In particular if we take c 2 = 4 we have A 1 = M T or A = M T X and the ellipse 
of concentration corresponding to T given by (4.2.1) is 


y'Mj 1 y = 4 


(4.2.5) 


For a fc-dimensional case we proceed similarly and take k dimensional 
ellipsoid y'Ay = c\ with c > 0. The elements of A and c are to be chosen 
such that M Y = M T . Define Z = PY where P'AP = A = diag (A 1? ..., A*) and 

Va • * 

take Uj = — L Z f , i = 1, 2, ...» k to get ^-dimensional sphere X «■ = 1. The 
c /=1 

distributions of U, Z and Y are obtained by using the formula for the 
volume of a sphere and an ellipsoid in ^-dimensions, to obtain pdfs as: 

/(«!, u k ) = r J- + \j/K k,2 ,I>uf < 1 


iai 1/2 t -^ + 1 


£/c) 




,Z A,z? < c 


1 a i 1/2 r h^ + i 


/(yi. y*) = 


K kll C k 


, y'Ay < c l 


Now by straightforward calculations we can show that E(U) = 0 which 

1 r* 

implies that E(Z) = 0 and £(F) = 0 and M v = jypf 1 Further M z = 

J Jr 

A -1 = ■ c PA~ l P' and therefore we should select c > 0 and A such that 


k + 2 


* + 2 


P4" 1 P' = PM T P' 


76 A First Course on Parametric Inference 


Taking c k = k + 2, we have A 1 = Mj or A = M T X and the ellipsoid of 
concentration corresponding to T is 


y'Mjy = k + 2 


(4.2.6) 


The volume of the ellipsoid of concentration of T is V T - 


C k 7t kl1 


( * 

2 +1 


\ 


\Mr l | 


which is proportional to I Af r I and Cramer recommended the criterion of 
minimizing V T over T e U r This is equivalent to minimizing I M T I the 
generalized variance of T for T € U ¥ which leads to the £)-optimal estimator 


Te U r 

Following the above approach of Cramer, Lehmann and Scheffe (1950) 
proposed a preference relation between T\ and T 2 e U ¥ . They recommended 
preferring T\ to Ti if for any a e R m and V 0 e £2 


a'M T l a > a' M T } a 


(4.2.7) 


or {Mjl - M^) is nnd which is equivalent to the criterion where we prefer 

Ti to T 2 if (M T2 - M Tl ) is nnd. This leads to the M-optimality criterion 

which we have adopted as the definition 4.1.1. 

In a similar way Trace-optimality can be described geometrically as 
minimizing the sum of half axes of the ellipsoid of concentration and E- 
optimality can be interpreted as minimizing the largest half major axis of 
the ellipsoid of concentration.JThis is a consequence of the fact that the 
characteristic roots of A which are (Aj,..., X k ) are reciprocals of characteristic 

roots of Mj l which are (1/Aj, ..., 1/A*). 


4.3 Klebanov-Linnik-Rukhin Theorem 

Section 3.6 showed that a necessary and sufficient condition for T* e U v 
a real statistic to be MVUE of a real parametric function \j/, is that 
Cov(7*, u) = 0, V u e U 0 , V 9 e £2. We now consider a generalization 
of this condition for a vector valued statistic T* e U ¥ , to be M-optimal for 
V/, or equivalently MVUE of y/. 

Let T = (T u ..., T k Y e U ¥ where yr = (y r h ..., y r k ) f and 0= (0 b ..., Q m )' 
where m > 1. Let U = (u h ..., u k Y e U^ k) defined by E(U) = 0, V 0 e £L 
Let Mju be the covariance matrix of T and U i.e. M w = E[(T - yr) (U')]. 
Then Klebenov-Linnik-Rukhin theorem states that 

THEOREM 4.3.1 (K-L-R THEOREM). A necessary and sufficient condition 
for T* € U ¥ to be M-optimal for iff is = M UT * = 0, V U e U {k) and 
V0 e O. 0 

We have already seen that M-optimality of T* implies that for each 

i = 1, •••, k, T* is MVUE of y/j and therefore for any such that E(uD 
= 0, V 0 € £2 we have ' 



Simultaneous Estimation of Several Parameters 77 

Cov {T* , Uj) = 0 V 0 e ft, i = 1 , 2, ..., = 1, 2, k (4.3.1) 

Therefore Afp^ = M UT > = 0, V u € !/<*>, V 0 € ft. 

On the other hand if Afp^ = 0 = M VT * consider U — T - T* where 
T e U v then 

Mt+(t-t*) = 0 or Mf*f = Afpp = M 77 * = Afp (4.3.2) 
Let M t *_ t be the Variance Covariance matrix of T* - T. Then 

Mp_ r = Afp - Mp r - Mjp + Af r 

= M t -A/p in view of ( 4 . 3 . 2 ) 

But as Afp_r is nnd, we have M T - Mj* is nnd for any T e U w and 
V 0 € ft or T* is Af-optimal for yr. 

Corollary 4.3.1 T* is Af-optimal for yr if and only if T* is MVUE of y/,- 
for each i = 1 , 2 , ..., jfc. 

Corollary 4.3.2 If T\ and Tj mre Af-optimal for yq and y/ 2 , respectively, 

then T* — AT[ + BT 2 is M-optimal for Ay/j + where A, Z? are real 
matrices of dimension l x Jfe. 

Corollary 4.3.3 If T 1 and T 2 are both Af-optimal for y/ then 7 \ = T 2 with 
probability one, V 0 e ft. 

As a consequence of Corollary 4.3.1 we can construct an Af-optimal 
estimator for yr = (yr h yr k )' by first obtaining MVUE, T *of each yr, 
using the methods given in Chapter 3 and then form T* = ( 7 j*, 
which will be Af-optimal for yr. Thus, we use Rao-Blackwell Lehmann- 
Scheffe Theorem, particularly in case of multiparameter exponential families 
to obtain Af-optimal estimators of a vector of parametric functions of interest. 
We now consider a few examples to illustrate this procedure. 

Example 4.3.1 Consider one way Analysis of Variance or where we have 
k samples of size n, each from N(ji h a 2 ) so that 

%ij = M/ + £{/* j = 1 » 2 , ..., n,-, i = 1 , 2 , ..., k 

where are i.i.d N( 0, cr 2 ). By straightforward calculations (which the 
reader should carry out) we can show that the joint pdf of N = I n, r.v.s X 
is a (fc + i) dimensional exponential family with (T h T 2 , ..., T k , S 2 )' as 
minimal sufficient and complete statistic where 

Tj = X h i = 1, 2, ..., k and S 2 = X, S 2 , 

= I (X (> - X,) 2 . Note that X,, i = 1 , 2, ..., k and S 2 , i = 1 , 2 , ..., it are 
mutually independent. Further X,- ~ N(p h c?/n) and Sf ~ cr 2 * 2 _j. Therefore 




78 A First Course on Parametric Inference 


Xj is MVUE for ft for i = 1, 2, k and it follows that (X,, Xj t h e 
vector of sample means, is Af-optimal or MVUE estimator of ... t ^y 

the vector of ^-population means. Now S 2 ~ (7 2 % 2 N _ k where N = £ n h as 

* 

S' 2 ~ <7 2 xl-\ are mutually independent. Therefore, <J 2 = Z Sf/N - k - 

S 2 /(V - X) is MVUE of <7 2 . Suppose we want to estimate \ff y = (ft - 
i.e. the difference between the i-th and they-th population means measured 
in units of population standard deviation <7. Then c(X/ - X y )/S, where c 
is chosen such that E{dS) = 1/(7 would be MVUE of iffy. Note that 
E[(Xj - Xj)c/S] = E(Xj - Xj) - E(c/S) because of mutual independence. 
Using the fact that S 2 ~ o 2 % 2 N _ k the value of c can be determined and we 


. . jN-k\ u /JN-k-\\ 

can show that c = JH —^— IV2 / Jj-^-I 

is then Af-optimal estimator of (yr 12 , ..., \ff\icY- 


. Vector (ft 2 , ft 3,..., ft*)' 


Example 4.3.2 (Chandrasekar, 1983). We consider now. an application of 
K-L-R Theorem to a case in which class U 0 can be determined. Consider 
a single observation from U(a, p) where £2 is defined by-°°<a<ab< 
Po < p < oo. Then X has pdf j{x, a, p) = 1 /(p - a) over ft. Let w(X) be an 
unbiased estimator of zero then 


P 

u(x ) dx = 0 V (a, p)' e ft. 


In particular taking a = (Xq and P = Po, u(x) must be such that 



J *«o 

u(x) dx = 0, V a < (Xq which 

a 


rP 

implies that u(x) = 0 for x <, Ofo. Therefore u(x) dx = 0, V P > Po which 

Jpo 

implies that u(x) = 0 for x > Pq. The arguments here are same as those 
indicated in the proof of completeness of X ( „) for 1/(0, 0), 0 > 0. Thus 

r a ° f«o /»«o 

J u(x) dx = 0 implies that J u + (x) dx - J u~(x) dx for all a < (Xq 

i.e. the measures assigned by u + (x ) and u~(x) for any interval (a, (Xq) are 
same which implies that u + (x) = u~(x) for x < Oq or u(x) = 0 for x < (Xq. 
Similarly we can show that u(x) = 0 for x > Pq. Thus 


Simultaneous Estimation of Several Parameters 79 


1 '0O 

«(*)<<* = 0} (4.3.4) 

<*0 

Now consider any estimator T(x) such that it remains constant on (cto, A>). 
Then Cov (T(x), u) = 0 for V u e U 0 and V (a, fi) € Cl and therefore T(x) 
is MVUE of E(T{x)). For example let 

T\(x) = — X € (Oq, p 0 ) 

= x, x * (Qo, &) 


Then 


BOW) = 


P - a 


r o ^ 

I xdx +1 

Jet J£ 


+f P xdx + (a ° * ^ (> 3 o ~ «o) 
'/*0 


03 - of) 
a + p 


«l-a\ p*-ft . - a o 

A "l /% I /» 


Thus T{x) is MVUE of ^ which is the mean of the distribution. Note 
that X itself is unbiased for — + — but X is not MVUE for — t - . Similarly 


(P 


k+1 *+l 


cr +I ) 


Define T fc (x) = c if 


consider fc-th raw moment ii£(a, P) = //3 . /f 1V 

^ (j3- a) (fc + 1) 

x € (ao, Pq) and x* if x £ (a, jS)‘. Then straightforward calculations show 
that 

c(P ,-ao) , 

E(T k (x)) “ • ^ ^ v'yi /rt 


03- a) (k + \)(P-a) (k + \)(p- a) 


p M - a 


*+l 


(k + \)(fi-a) (P- a) 


aj +l - j3j +1 
c(^ 0 - ao) + + 


n k +1 _ ^ry ^+1 

If we take c= Po ^~ 


(* + l)(Po~ a o) 


then 


w = PFi) = " i(aJ) 

As Cov (T*(x), «) = 0 Vk6 Uq and (a, 0) e ft each r*(x) is MVUE of 


k+\ 


80 A First Course on Parametric Inference 

M i(a> p). It now follows that (T x (x ),.... T k (x))' would be M-optimal for the 
first k-moments (oc, ft) ./**(**» • 

Example 4.3.3 A model that commonly occurs in modelling failure time 
distribution is described by the pdf 


f(x, //, <r) = -1 exp 


(x - H) 


// > 0, (7>0 


Here fi denotes the minimum life before which the item would not fail. The 
pdf also describes prognosis of a disease where X would denote the serum 
level and fi is the threshold point. The model is a combination of Pitman 
family and Exponential family since for fi known it becomes one parameter 
exponential family and for <7 known it belongs to Pitman family where the 
lower end depends on the parameter fi. 

As the r.v. X is continuous, a sufficient statistic is given by the order 
statistic (X (1) , ..., X (n) ) the failure times of the first failure, second failure 
etc. Using the one to one transformation given by Y x = n(X (l) - fi), y. = 
(n-i+ 1) (X (i) -X (i _ x) ), i = 2, 3, ..., n, we can show that (Y x , Y 2 , ..., Y n ) 
are i.i.d. exponentials and 


..., X( n ), fi, g) — ^ exp 


n (x m - fi) 


irj exp {- 2 yja\ 
/=2 


and hence X m and T = 2 \Y, = 2 (n - i + 1) (X (il -* (W) ) = 2 (X (0 -X (1) ) 

i —2 i =2 /—2 

are independent with pdfs given by 


£i (*(U> o) = ~ exp 


n (*(D ~ /*) 


M(i) > H 


g 2 (U n, G) = 


r(/i -1) 


~~ e~7(T f" -1 , r > 0 


By factorizability criterion it follows that (X fl) , T)' is sufficient for (it, &i 
an one can show that it is minimal sufficient and complete. The proof is 
eyond our scope and we will therefore assume the result. 

Suppose we want to obtain M-optimal estimator of (ji, a)'. Then -L- 

is MVUE of a. Observing that E(X m ) = n + otn We have X ln _ T — 

is MVUE of /r. Therefore the M-optimal estimator of (/t, a)' is giten" by 

(x L T V 

l n (n - D- • The MVUE of E(X) = M + a is then given by 



Simultaneous Estimation of Several Parameters 81 


T\ = X® - - ( n _ ^ j - X (1) + ~ = [nX (1) + X (X w - X®)]/ n 


r=±ix (i) =x 

n i v; 

Consider the reliability function, the probability that the item will not 
fail before time t 0 

R(t 0 ) = P[X > t 0 ] = 1 if t 0 < H 


= exp 


(«0 - M )1 


if r 0 > M 


Then Laurent (1963) showed that by Rao-Blackwellizing w.r.t. T), 
starting with the estimator 

TiiK i) =1 if Xj > r 0 

= 0 otherwise 

the MVUE of R(t$) is given by 
R(t 0 )=\ if r 0 < jc (1) 



1 _ r ° ~ * 0 ) 


n -2 


*( 1 ) < *0 < *( 1 ) + 


(-*<0- *(») 


— 0 I 0 > X(i) + £ (*® - x f ] j) 

1=2 

We have then the result that (Rfa), ..., R(t k ))' is M-optimal for the vector 
of reliability functions evaluated at fc-points (t h t k ). 

Exercise 4.3.4 Let (X„ X„) be i.i.d. N(fi, o 2 ). Then (X , S 2 )' is complete sufficient 

for ill &)' and ( X , S 2 /(n-1)) is M-optimal for (/x, o 2 )'. Let ^(X,) = 1 if X { < a { and 
zero otherwise so that E^X,)] = ^(a, - /x)/<t). Now for any C 


P[X,<CI(X,S 2 )) = 



X, -X 
S 


< 


C-X 

s 


IX, s 2 


= P[U <u 0 IX, S 2 ] 

where U = >/n(X, - X)//(«-1)5 2 and w 0 = ^(C - X)//(»-1)S 2 '. The distribution 

of 1/ is independent of (X, S 2 ), in view of Theorem 3.6.5 (Basu's Theorem) and 

therefore P[U < u 0 I X, S 2 ] = P[t/ < u©]. The distribution of U is known and its pdf 
is given by F 


g(u) = -I- -1^21 (1 _ M 2-v(n-4)/2 | 

* ' n r((n - 2)12) U 2 ’ 1 M 


< 1 


om which P[(/ < mq] can be obtained by using tables of incomplete beta distribution 
mg into account the sign of uq for given observed value of (X, S 2 ) for the sample 
e idea is to use the fact that U 2 is beta, B(l, (n - 2)12) and use the result that 



82 A First Course on Parametric Inference 

P[U < m 0 ] = \ + P[U 2 < «o ] if «o > 0 
= j - P[U 2 < u J] if u 0 < 0 

Kolmogorov (1950) considered this problem in the context of quality control. 
Suppose that specifications indicate that the diameter of head of a screw 
must be within (1 ± 0.01) mm. The proportion of acceptable screws in a 
given batch, when the process mean is p mm, and process variance is <x 2 
(mm) 2 is given by 


7T= 0 


1.01 - p 


- 0 


0.99 - ii 


and we can obtain MVUE of n using the above technique. The batch then 
is acceptable if n > ttq = 0.95 say, i.e. percentage defective < 5%. 

We point out here that the abovementioned paper of Kolmogorov defines 
sufficiency from Bayesian view point and also gives Rao-Blackwell Theorem. 
In some Russian texts the Rao-Blackwell Theorem is referred to as 
Kolmogorov-Rao-Blackwell Theorem. 


Example 4.3.5 Let (X h Yj), i = 1, 2, ..., n be i.i.d. bivariate discrete r.v. 
with p.m.f. given by 






f(x, y, A, p) = p y(l - pf-y e - x A*/*!, y = 0, 1,2, x, x = 0, 1, ... 


Then as seen in Example 2.5.5 we have Vy = X X x and V 2 = X Yj are jointly 
complete and sufficient for (A, p)'. The conditional distribution of Vo given 
V\ is binomial and marginal of V\ is Poisson with mean /iA. Hence the joint 
pmf of (V„ V 2 ; ) is 


g(v 1 , v 2 , A, p) = 





(1 - P yi-V 2 e -nXvy ( n A,yi/ V] \ t 


v 2 = 0, 1, ..., V h Vy = 0, 1, ... 

As E(V 2 I V,) = V t p we have E(V 2 ) = pE(V t ) = nXp. Hence V 2 ln is MVUE 

of Xp and V,In is MVUE of X and therefore (V,/n, V 2 /n)' is M-optimal for 
(A, Ap) . 

By a straight forward calculation we can show that Y is Poisson with 

™ m ^ n< ^-A enCe y* * S ^°* sson mean nkp. Consider estimation of 
' ~ e - This is the probability that no wrong calls are connected 
at t e exchange if X = the number of total telephone calls received in unit 

me an Y - the number of calls wrongly connected. Using the results of 
Example 3.4.2 we have 





Simultaneous Estimation of Several Parameters 83 


f tl 1 \ v 2 

<P(Vi, V 2 ) = "2 = o, 1, 2, ... 

is the MVUE of <H'\ 

Exercise 4.3 (1) Let y t = a + /ix, + g„ i = 1,2, .... n be the normal regression model 
where e, are i.i.d. N(0, tJ 2 ) and where xfs are constants with X (x,- - Jc ) 2 > 0. As seen 
in Example 2.5.4 the joint pdf of (y lt .... y„) belongs to 3 parameter exponential family. 
Determine M-optimal estimator of (a, /?, a 2 ), if it exists. 

(2) Let (.X, Y)' be bivariate normal with parameters (ji lf p 2 , cr 2 , o\ t p). Show that 
the pdf of (X, ny belongs to five parameter exponential family. Find Af-optimal estimators 
of (a) the mean vector (p\, p 2 Y (b) Variance-Covariance matrix 

f 2 \ 

a 1 pa l o 1 

^P<Ti<t 2 a] J 

(3) Let Xj and X 2 be independent exponentials with means d\ and 0 2 . Show that the 
joint pdf of (Xj, X 2 )' belongs to two parameter exponential family and obtain A/-optimal 
estimator of (0 t , 0 2 , P(X < y))'. 


4.4 Cramer-Rao Inequality 

Let X be a random vector with pdf belonging to the class {/(*, 9), 9 <= 
c /? m } which satisfies the regularity conditions given in Section 2.2 so 
that the Fisher information matrix J(9) is well defined and is pd. Let y/ = 
(Vb •••> VmY be a vector of parametric functions such that the Jacobian 

is non-zero or 9 \f/ is one to one transformation. Let Dy 

denote the matrix D = (dyJdQj) = d it , i = 1, 2. mj = 1, 2, ..., m. Let 

T e U v be such that 

J Tj(x) f(x\ 9) dx = \j/ h i = 1, 2, m 


^(^i, 9 m ) 


can be differentiated under the integral sign w.r.t. 9j,j = 1, 2, m so that 


y 


i.e 


J 

Cov (T h Sj) = djj 


(4.4.1) 

where S = (S U ..., S m )' is the vector of score functions (— 

1 ^ Q ’ * * *> ^ q I • 

Thus O is the Variance-Covariance matrix. Let me«, and consider 

~" Sa "^ Z = V ' T -Then Cov (Y,Z) = E(u'ST'v) = u’Dv and Var ( Y) = u'Ju 
nd Var (Z) = vM T v. Then by Cauchy Schwarz inequality 

(u'D'u) 2 < (u'Ju) (v'M t v) 


(4.4.2) 


84 A First Course on Parametric Inference 


Let u = r'D'v then as (/"')' = r 1 . 

(v' Dr'D'vf < ( v'Dr'J r'D'v) . (v' M t v) 

<, ((/ Dr'D'v) (v'M t v). 

However since J is pd and D is non-singular we have 

0 

v'DJ~ l D'v £ 0 and thus v' DJ~ l D'v < v'M T v 

or v\M t - DJ~ l D')v > 0 v e R m . 

Thus M T - DJ~ X D' is nnd and we have the theorem, 

THEOREM 4.4.1 Under the regularity conditions given above, for T e 
M t - Dr l D' is nnd. 

Corollary 4.4.1 For y/= (0 lt ..., 0 m Y we have D = I. 

M T - J~ l is nnd. 

Remark 4.4.1 For the generalized variance of T, defined by I M T I we have 
I M t I > I D l 2 /l J I and Tr(M T ) > Tr(DJ~ l D'). For v^= (0 U ..., 0 m ) we have 

k 

I M T I > 1/1 y I and Tr(M T ) > E J"(6) where f l denotes the (/, i)-th diagonal 

i=i 

element of the inverse of the Fisher information matrix J. Further for i-th 
component of T 

m m 

Var(7)) >11 dvJ*dn. 

H 1=1 


Example 4.4.1 Let {X,}" be i.i.d. N(6 h 0 2 ) and let y/= (0 h d 2 ). Then as 
observed in Example 3.7.3, X is MVUE of and S 2 /(n - 1) is MVUE of 
02 and T = (X, S 2 /(n - 1))' is M-optimal estimator of (0 lf 0 2 ) / . But Mt = 

^ as X ~ N(0 \, 0 2 /n) and S 2 ~ 0 2 xl-\ and X and S 2 

are independent. Routine calculations will show that the Fisher information 
fn/0o 0 ^ 

, so that the CRLB to M T is given by 


f 0\!n 


0 26 2 l{n - 1) 


matrix J(0) = 




0 


n!20 2 j 


DJ~ l D' = 


( 0 2 /n 
0 


0 


2 0\!n 


and M t - DJ~ l D' = 


(0 0 
0 20 2 I{n{n-\))j 


Example 4.4.2 Consider the bivariate discrete distribution given in Example 
4.3.5. Routine calculation show that 


J(X,p) 


( n!\ 


0 ) 

nXlp(l - p)j 


Simultaneous Estimation of Several Parameters 85 


and for yt = (A, p) we have M T - [ ^ 1 is nnd by Theorem 

VO p(l - p)/nA J 

4.4.1. However in this case we can show that Uy is empty as although 
unbiased estimators of A and Xp exist, there does not exist unbiased estimator 

of p. To prove this, if possible let u 2 ) be unbiased for p then for any 
p e (0, 1) and A > 0 we have 


£ S <p(u„ o 2 ) *-"* f U| '|p l '2(l - = P 

ui=0 U2=0 IV. I v 2 ) 

or letting nA = 0 we have for any p e (0, 1) and 0 > 0 


I 

i>l=0 


£ rW^ (i - p)»i-»i 

«2“0 Vl>2 J 


e v ' _ y (n 

Ui! ” P wi=o Ui! 


(4.4.3) 


Compare the coefficient of 0° on both sides of (4.4.3) for a fixed 
p € (0, 1). Then we must have for each fixed p e (0, 1). 


<p(0, 0) = p 


(4.4.4) 


This is a contradiction since <p(0, 0) must take only a single value in R h 
and therefore U p is empty. 

Remark 4.4.2 As observed in the case fc = 1, the equality is attained in 
CRLB iff there exists a matrix A(0) such that = A(0) [T - yKQ)] 

Ou 


where T = (T b ..., TJ y= (W, •••. yj and 


01og L 
00 


is the vector of score 


functions 


( dlogL 
V *0i 


I • can show that in this case the pdf 


belongs to an m-parameter exponential family and T is minimal sufficient 
for 0. 

In the above work we have assumed that the dimensionality of the 
parameter is same as that of the vector of functions that we want to estimate. 
This need not be the case. Suppose 0 = (0 lt ..., 0 OT )' and y = (yi(9), ..., 
Vk(Q)Y and let D = (dy) belxm matrix of E(dyjldQj), i = 1, 2, ..., k, 
j = 1, 2, m. Then CRLB results states that M T - DJ~ { D 'is nnd V 0 € 
ft c R m . The proof is exactly similar as given in case of k-m. The case 
k<m is usually of interest which corresponds to say for example, estimating 
& l) = (0i, ..., 9 k )' only while (0 fc+1 , ..., 0 m )' acts as a nuisance parameter. 
In this case CRLB result states that M T -J n is nnd where J~ l is partitioned 

(J u 

V J 21 J2 2 


as 


. Consider Example 4.3.5 and let 0i = A be the parameter 


86 A First Course on Parametric Inference 

of interest and 0 2 = p be a nuisance parameter. Then J n = Mn and V ar (7) ^ 

Xtn for any T e U x and CRLB is attained by ~^=X. Similar situation 

occurs in Example 4.4.1 N(6 h 6 2 ) where 0, is the parameter of interest. i n 
the same example if 0 2 is the parameter of interest CRLB, / is given by 
2 S 2 /n and there does not exist T e U 02 which attains CRLB although 
MVUE of 02 is gi ven by S 2 /(n - 1) as seen before. On the other hand i n 
Example 4.3.5 unbiased estimator of p does not exist and attaining CRLB 
p(\ - p)/nX for estimating p does not arise. In both of these examples J is 
diagonal matrix. This case is referred to as parameters being orthogonal 
and we have Z 11 = 1 //„. However if/is not orthogonal we have 7 11 > \/J n 
and we have a sharper lower bound for variance of T^e U 0l when nuisance 
parameters (0 2 , OJ' are involved. We illustrate this by an example. 

Example 4.4.6 Let (X u ..., X n _i) be i.i.d. exponential with mean 0 > 0 and 
let X n be independent exponential with mean A0 where A > 1. This model 
is used to describe a situation in which an identified outlier may be present 
in the life testing experiment. Now the joint pdf is given by 

L(x, 0,1) = ^ exp {-%U M where S„_, = I X,, 

The joint pdf belongs to two parameter exponential family with ( S n _ b XJ 
being complete sufficient for (0, A). Routine calculations will show that 

(n/9 2 l/Atf'l . ( e 2 /(n- 1) - X9/(n - 1)) 

7 = „ and J = . ,, 

K MX9 MX 2 ) X9/(n - 1) nX 2 /(n-l)J 

Let parameter of interest be 0 then Z 11 = &!(n - 1) and -j— = < Z 11 = 

2 C 

- - . Now we can show that by RBLS Theorem —^ is MVUE of 0 
n- 1 ' n -1 

Q 2 i. 

with Variance ■ ■ r = Z . On the other hand CRLB for estimating A is 

given by — j = J 22 > -j— = A 2 . Using the fact S n _ { is Gamma with 
It — 1 J 22 

parameters (n - 1) and 0, an unbiased estimator of -jr is — 1 —. As X n and 

tf o„_i 

5„_! are independent E — = A and — is MVUE of A 

*Vi J 

(s V 

and thus — " -— c ) is Af-optimal for (0, A)'. We calculate and 

\ n 1 *^«-i J 

particularly note that Z 22 = . is smaller than 

W — 1 



Simultaneous Estimation of Several Parameters 87 


Further 


' (n - 2)X„ > | 

v S «-l J 


X 2 (n -1) 

(» - 3) * 




~ X9 _ jr 12 
n - 1 


4.5 Other Optimality Criteria 

In Sec. 4.1 we have defined M, D , T and £ optimality criteria and observed that 
M optimality implies D, 7 and E optimality. Now corresponding to each of 
these criterion we have preference order between T x and T 2 e U v . For 

example we prefer T\ to T 2 as per D criterion if I M T] I < I M Tl I, V 6 e Q. 

Now if A and B are symmetric positive definite matrices then I A I < I B I 
does not imply that ( B - A) is nnd and therefore if we prefer T\ to T 2 
according to D-criterion it does not follow that we should prefer T\ to T 2 
as per M criterion. Similar remark holds for T and E optimality criterion. 
Thus although M-optimality of T* implies D, T and E optimality of T* 

given that 7J is D-optimal say, it does not follow that it would be M, T 

or ^-optimal and in general we may have different optimal estimators 
according to different optimality criteria. 

The following shows that fortunately this does not occur and if T* is 
D- optimal then it is M-optimal and therefore it is T and E-optimal also. 
Similarly, if T 2 is 7-optimal then it is M-optimal and hence also D and 
Zs-optimal. We show by an example that similar result does not hold for E- 
optimal estimator and £-optimality does not necesarily imply optimality as 
per other criteria. Thus M, D, and T criteria are equivalent in the sense that 
they lead to the same optimal estimator, but E criterion is different. 

THEOREM 4.5.1 (i) If T* is T-optimal then T* is M-optimal and hence 
D and E optimal also. 

(ii) if T 2 is D-optimal then T* is M-optimal and hence T and E optimal 
also. 

Proof Let u e with Var (m) = ot(9) > 0. Suppose T* is not M-optimal. 
Then it follows that there exists a subscript i such that Cov(T*, u) * 0 at 
some 9 0 € £L Let j3 ; (0) = Cov (T*, m)/o(0) and let j3(0) = (0,(0),..., ft(0))'. 
Then j8(0o) * 0. Define T = T* - /3(0o)w so that T e Uy. Now at 0 q 


M t (6 b) 







88 A First Course on Parametric Inference 


and therefore 

M t .( 0 o ) = MM) + oKOq) P(9 0 ) /F(0 o ) (4tS>J 

which implies that 7r[M r *(0 o )] > Tr[M-j(0Q)] as f3(9 0 ) * 0. This contradicts 

the datum that T* is T-optimal. Hence by reductio ad absurdum argument 
T* must be Af-optimal and therefore also D-optimal. 

To prove (ii) we again argue as above. Suppose T* is not Af-optimal then 

there exists a subscript i such that Cov (T£,u)* 0 and thus if T= T* -P(0q)u 
we have analoguous to (4.5.1) the result that 


- MjiOo) + a(Qo)IKQo)P(9o) (4.5.2) 

ie. A = B + C 

where A, B are pd matrices and C is nnd. By using the fact that rank of C 
is unity and that of B is k, one can show that I B + C I > I B I by using 

simultaneous diagonalization of B and C. Hence I M r *(0 o ) I > I Mj{%) \ 

which contradicts the datum that T* is D-optimal and hence the result. 

For more details we refer to Kale and Chandrasekar (1983) which contains 
two distinct proofs of part 2 of the above theorem. 

We now show by an example that E-optimality of T 3 does not imply M, 
T or D optimality of T 3 . 


Example 4.5.1 [Kale and Chandrasekar (1983)] 

Let (X h ...» Xjo) be a random sample of size 10 from N(0 lf 9ff where 
the parameter space is = 1(0 1 , 0 2 Y I 9 X e R x , 0 2 Z 5}. Then the pdf 
belongs to two parameter exponential family with (X, S 2 )' as minimal 
complete sufficient statistic for (9 X , 0j)' even though the parameter space 
&o is a truncated subset of the natural parameter space Cl = /?, x (0, °°)- 

Now by RBLS theorem X and S 2 /9 are MVUE of 9 h 0 2 respectively and 


02 

10 ’ 


20V 


For 


T* = (X, S 2 /9)' is Af-optimal for (0,, 0 2 )' with Mr* = diag 

9 e CIq, = 20|/9. Now consider T x = (X h S 2 I9)' and M Ti = diag 

(0 2 , 20|/9) and X x (Mr { ) = 20|/9 = X\(Mr*) as 02 > 5. Now we know that 


T* is ^/-optimal and therefore D, T and E optimal and the above argument 
shows that T x is also E-optimal since A,(M 7l ) = X^M-p*) <, X x (M T ) for any 

T € U y. Thus T\ is E-optimal but not A/-optimal. Hence we conclude that 
although M optimality implies E optimality, the converse is not true. 


Remark 4.5.1 We observe that the D, T and E optimality criteria are 
re ated to characteristic roots (X x ,..., X k ) of the Variance-Covariance matrix 


Simultaneous Estimation of Several Parameters 89 


n 

M T . Thus D criterion is based on minimizing II A,-, I-optimality is based 

on minimizing X X t and ^-optimality is based on minimizing max A/. One 

could consider other symmetric functions of (A„ X k ) to develop other 
criteria for optimality for vector unbiased estimator T for y/. However as 
observed in Section 4.2, T, D and E optimality have nice geometric 

interpretations. 



.Consistent Estimators 


5.1 Consistency 

This chapter deals with the estimation of a real or vector valued parametric 
function y<0), based on a random sample of size n, where n is assumed to 
be large enough such that if T is an estimator of yK&), its performance can 
be studied by using large sample or asymptotic distribution of T. 
Corresponding to the criterion of unbiasedness in Chapter 3 we now consider 
the criterion of consistency of an estimator T. We assume that T, a real 
valued statistic, is to be used as an estimator of real parameter 9 based on 
a random sample of size n from { f(x , 0), 9 e £2} where £1 a R { . 

Since the behaviour of T is to be studied for large values of n only, we 
consider the sequence of r.v.s. {T n } and base our study on its convergence 
properties. There are several types of convergences that can be defined e.g. 
convergence in probability, convergence in distribution, convergence in 
quadratic mean among others. We refer to Cramer (1946) Chapter 20 Sec. 
6, and Rao (1973) Chapter 2, Section 2-C for definitions and properties and 
further details which are adequate for the purpose of this text and it will 
be assumed that the reader is familiar with the material particularly with 
the Weak Law of Large Numbers (WLLN) and the Central Limit Theorem 
(CLT) for i.i.d. r.v.s. 


Definition 5.1.1 An estimator T n is said to be consistent for 6 if 
for each 9 e £2 and the convergence in probability is taken under the 
distribution indexed by 9. 

Recalling the definition of •¥-> there are following alternative ways in 
which consistency can be defined. 

Definition 5.1.2 (a) For any e > 0 

lim P B [ I r,-#l<e]=lV»s Q (5.1.1) 

n—> oo v 

(b) lim P„[l r„-0l>e] = ov 0s 12 (5.1.2) 

n—>oo v 

(c) For any e > 0 and 5 > 0 there exists an n 0 {e, 5, 9) such that for 
v n > n 0 (e, S, 9) 

- 0 I < £] > 1 - 5, V 0 e £1 (5.1.3) 

or 

T n - 6 I > e] < 8 y V 9 e £1 (5.1.4) 



Consistent Estimators 91 


All these definitions essentially convey the requirement that for large samples, 
with very high probability the values of the estimator T n are sufficiently 
close to the “true value” of the parameter 0. Indeed if T n is consistent for 
6 then P 6l [\ T n - 6 2 \ < e] does not tend to unity or P 6[ [\ T n - 0 2 I > e] does 
noUend to zero, as n -> «>. Thus T n does not converge in probability to d 2 
^ ta ^ en un( * er * 02- This is a consequence of the property that the 

probability limit is unique, i.e. if Y n then it cannot converge to another 
constant b . 

A very important property of a consistent estimator is the invariance 
under continuous transformation, a property not enjoyed by an unbiased 
estimator. Thus if y/[6) is continuous function and if T is consistent for 0 
then y/{T) is consistent for y/{9). As seen earlier E(T) = 0, V 0 e Q does 
not in general imply that E(y/{T)) = \j/{6) unless if/ is linear in 0. 


THEOREM 5.1.1 Let T > 0, V 6 e Q and let y/ be continuous function 
from Q. to R { then y<T) -^»y/(0). 


As y/ is continuous, given e > 0 there exists a 8 such that I yt(T) - y/(0) I 
<£ whenever I T - 9 I < 5. Let E e = {* 1 I y<T(x)) - y/{9) I < e} and F s = 

{x\\T{x)-9\< 8). Then F s c E £ and P£F S ) < P£E e ). But lim P£F S ) = 1, 

n — 


V 0 e a and any 8> 0. Therefore lim P£E e ) = 1, V 0 e Q and V e > 0. 

As a consequence of this property of invariance of consistent estimators, 
for all practical purposes we need consider consistent estimators of 0 only. 
Now suppose that the parameter 0 is vector valued say 0 = (0 b Q Y 

Then a vector valued statistic T = (T h ..., TJ is said to be consistent'for 
0 if 


Ti 


0i, i = 1 , 2 , 


m, V 0 € Q 


(5.1.5) 


This is analogous to defining a vector valued statistic T to be unbiased for 
vector valued function y/(0) if T is componentwise unbiased. However in 
a vector space of m dimensions one could also define convergence by 
introducing a distance function defined by a suitable norm. Thus we could 
define T to be consistent for 0 if 


Jim P e [ 11 T - 0 II < e] = 1, V 0 € Q, V e > 0 (5.1.6) 

Now we take II T - 9 II = max I I, - 0, I, i.e. the supnorm which is 
equivalent to the usual Euclidean distance, namely, II I - 0 II = 

It 1 1/2 

{,5 Wi ~ 0/) 2 [ • The definition of consistency induced by (5.1.5) is usually 
referied to as marginal consistency whereas that induced by (5.1.6) is called 



92 A First Course on Parametric Inference 


as joint consistency. However, one can show that marginal and io' 
consistency are equivalent. nt 


THEOREM 5.1.2 T is marginal consistent for 0 iff T is jointly consist? 
for 9. nt 

Now, assume that T is jointly consistent so that (5.1.6) holds and ] e t 
Ei = {x I T t - 0,1 < e) and let E = {x I Max I 7, - 0, I < e). Then we have 

E = Q E ( and therefore E c £, and P^E)) < P^E ( ). As T is jointly 

consistent, for any e > 0, P^E) -> 1 as n -> V 0 e £2 and therefore f 0r 

each i — 1, 2, ..., m , PffEj) —> 1 as /i —> V 0 € £2 and each T-, 

or T is marginally consistent for 0. ' 

Next assume that for each i = 1, 2, .... m, 7} A>0,. then by (5.1.2) 
?ef<Ef) —» 0 as n —> «>, V 0 € £2. As £ = n E„ we have E c = U E? and 

m 1 ^ 

^(£ c ) ^ 2) /VEf). But this implies that PrfJF) -» 0 as w -> « V0 e Q 

and V e > 0. Thus T is jointly consistent for 0. 

We can now generalize Theorem 5.1.1 to a situation in which T and 0 
are vector valued and y/ is a continuous function from £2 to say R k where 
k is not necessarily equal to m. 


THEOREM 5.1.3 Let T be jointly consistent for 0 and let y/ be a k- 
dimensional continuous function from £2 to R k then yXT) is jointly consistent 
for y/(9). 

Since y/ is a continuous function from £2 c R m to R k given e > 0 there 
exists a £ > 0 such that II yXT) - y^9) \\ < e whenever II T - 0 II < S. 
Therefore P£ II yXJ) - y<0) II < e] > P e [|| T - 0 II < 6\ and as T is jointly 
consistent, for any S > 0, and V 0 e £2 we have 


lim Pq [II T - 0 II < <5J = 1 which implies that 

;i->oo 


lim /VII yXT) - v<0) II < e] = 1, V e > 0 and V 0 € £2. 

/!—»«> 

Thus y<7) is jointly consistent for y<0) and therefore each y/fT) is consistent 
for w(0), i = 1, 2, ..., £. 

We now consider some examples to illustrate the above theory. 

# 

Example 5.1.1 Let {X,}f be i.i.d. N(9 , 1) and consider the MVUE of 0 
given by X. As {X,}" are i.i.d. with £(X,) = 0 by using WLLN we have 

X —> 0 and thus X is consistent for 0. Note that WLLN does not 
provide us any information about the rate of convergence of p n {£, 0) = 
^[1 X - 0 I < e] 1. Noting that Var (X) = l/n f by Tchebychev’s 


Consistent Estimators 93 


inequality p n (£» 0) — Pg [I X — 01 < e] > 1-L_ —» l V 0 e Q. We can 

then determine «o( e » 0) as needed in definition (1.2.c), if we select « 0 


such that 


1 >1-5 or n > —l 


Thus we can take n 0 = 


£ 2 5 


+1, 


ne* e 2 8 

where [m] denotes the largest integer less than a positive real number u. We 
observe that this /iq(£, 5, 0) does not depend on 0 and therefore p n (£, 9) —> 
1 uniformly in 9. In this case we say that X is uniformly consistent for 9 


or X —> 9 uniformly for 9 e £2. Next since X ~ N\9, ~J we can obtain 
an exact value for 


Pn(£> 9) = 0(^Jn e) - <J>[- Vn £] 

= 2$(yln £)) - 1 

as d>[>/n £] — > 1 as n —> «», p n (g, 0) —» 1 as n —» «>. Further to determine 
n 0 (e , 5, 0) we must select n 0 such that 

2#h/n £] - 1 > 1 - 8 or 0[Vn £] > 1 - 8/2 


1 


or n > {<P '[1 - <5/2]} 2 or n 0 = 


-7 {<» [1 - S/2]) 2 


+ 1 


As is expected, the values of hq determined by Tchebychev inequality would 
be much larger than the one determined by using the exact distribution of 
X ~ N(9, 1 In). Thus for £ = 10 -1 , 10" 2 and 5 = 10' 1 , 10" 2 we get the 
following data: 


n 0 (Tchebychev Inequality) n 0 (Exact distribution) 


&£ 0.1 0.01 81 e 0.1 0.01 

0 1 10 3 + 1 10 5 + 1 0.1 385 38417 

0.01 10 4 + 1 10 6 + 1 0.01 790 78962 


The practical meaning of n $(£, 8, 9) is that it gives the minimum sample 
size required to attain the level of accuracy specified by (£, 8) combination. 
Thus the minimum sample size required to estimate the mean 9 in N{9, 1) 
case correct to first place of decimal (£ = 10' 1 ) with probability p„(£, 9) > 
.99 (i.e. 8 = 0.01) would be 38417. Note that n 0 determined by using 
Tchebychev inequality will be 10 5 + 1 or over hundred thousand. 


Example 5.1.2 Let {X,}f be i.i.d. N(9 h 9 2 ) then by WLLN applied to 
{X,} and {Xf} we have m( =X-^-> 0, and m 2 = ^-XX? 0 2 + 0, 2 . 

Now by invariance under continuous transformation X 2 0 2 and 
f^ 2 ’ ~ XX 2 ! is consistent for (0?, 9 2 + 0 2 )' and, therefore, 




94 A First Course on Parametric Inference 


±LX : 2 -X 2 =j- = jl,(X, - Note that while X is MV(j£ 

for e,,X 2 is not MVUE for 8 2 and ^ is not MVUE for 8 2 . The MV(J£ 

of 3>is -~t = (— A) ^Aft, ,s i-Aa. . 


of e 2 is _£L- = -0-)si±+e JU* * « 

n - 1 U - U " @2 « 02 and 1. This 

follows from the property that if Y„ -> a and {a,,} and [b n ] are real 

sequences converging to a and b respectively then a n Y n + b n -—>aa + a 
W e also note that the standard deviation fa the positive'root of 0 2 , i s 

consistently estimated by fafn = fa. Note that in general taking square 

root ts not a function, but if the domain and range is restricted to positive 
axes we have infact a continuous function. P 

Consider now estimating p-th percentile of N(G h d 2 ) given by 0, + <*, ^ 

which is acontinuous function of (0„ fay. Then X + &S/Vn is consistent 
estimator of 100p% point of N(8„ @ 2 ). On the other hand suppose we want 


function of (0 b y we have 


to estimate 8 2 ) = P[X < a] = 0 then as 0 is a continuous 

function of (0„ fa)' we have 0 is consistent for %«?,, ft). 

We note the ease with which we can obtain consistent estimators as opposed 
to MVU estimators of these functions. 

Example 5.1.3 Let {(X h Y t )}” be i.i.d. Bivariate Normal with mean vector 


then as & is a continuous 


is consistent for ifr a (0 h 9 2 ). 


Then {X,}" being i.i.d. 


(Pj, p 2 ))' and covariance matrix ( P<Jl ° 2 \ Then {AT,}" being i.i.d, 

\p&i g 2 <y 2 J 1 6 

A(pj, of) we have from Example 5.1.2 

X-Z+H„£- = I.(X i - X) 2 /n A of. 

Similarly Y A and L 5, 2 A a\. To estimate covariance term let - S„ 

n * /7 

= jl(f 1 -i)(l’ r f)=ll): j i;-H Now X X X/ Y, A £(X, y f ) and as 
rAftFAgj.i^, A £(XT) - £(X) ZT(K) = p(T]<7j. Now 

P 

C 2 l<y\G 2 = p and therefore Sxy^x^y i s consistent for p. 

The reader can easily generalize this example to the case where (2fj, •••! 
X k Y is multivariate normal with mean vector (fi h ..., p^)' and variance 
covariance matrix A. Thus it is easy to show that the sample mean vector 
(X [, ..., X k Y is consistent for (p], p^)' and A is consistently estimated 
by the sample variance covariance matrix S where 

S ¥ = £ I, - X,Xj, i=I,2 .*,>=!, 2, ..., k. 






Consistent Estimators 95 


Once the above result is established we can conclude that the population 
multiple correlation coefficient p, 2 23 k = 1 - | a I/A u I A (U) 1 is estimated 
consistently by the sample multiple correlation coefficient 

^i,23 ...k ” ^ ~ § 1 5 (U) |* Using similar techniques we can claim that the 

population partial correlation coefficients and regression coefficients are 
consistently estimated by their sample counterparts. 

Example 5.1.4 Let {X,}J be i.i.d. Then we know that X — > fx iff E(X) 
= /x. Thus if we have a random sample of size n from a population for 
which E(X) does not exist then X, the sample mean would not be consistent 
for the population mean. Such a situation obtains for the Cauchy distribution 
with location parameter ju, having pdf 


/(*>/*) = --- 


71 1 + (x - n) v 


x e R h n e /?! 


and X does not converge in probability to fl. Indeed we know that X has same 

pdf as that of a single observation and P[l X — fi\ < e] = — tan -1 (£) which 
docs not tend to one £ls vt " ) 

In case of Pareto distribution with pdf f(x, A) = ~4t, a: > 1 which is 

often used as a model to describe income distribution, E(X) exists only if 

A. > 1 and then it is given by .— . Therefore X —> — only when A 

_ A—l A — 1 

> 1 but for A < 1, X will not converge in probability. 

In the next Section we will consider the method of moments in general 

and the quantile or percentile method of estimation in Section 5.3. 

Exercise 5.1 


1. Let [X,)1 be i.i.d. Gamma, G(A, <r) with p.d.f. Ax, cr, A) = —1—£-11 v*-i 

T(A) (jA x ’ 

x > 0, o > 0, A > 0. Using WLLN for {X,} and {X?} show that A Acr and 

p . n 

A(A + l)cr and therefore m 2 = m' 2 - (mf) 2 —> Aa 2 where m[ and m k 


m 2 


denote k -th raw and central moments of the sample. Show that 
consistent for (cr, A)'. 


m 2 m, 


2 V 


/ ' 


V 


m\ m 2 


is 


2. Let {X,}j' be i.i.d. 6(1, 6) show that X 6. Using this fact obtain (i) a consistent 
estimator of the probability that in future m trials s successes occur and (ii) a 

consistent estimator of 0(1 - 6) the population variance, (iii) — ^ — = / x (0), 
the Fisher Information per unit observation. 

'X x — p n 

Le t (Xj)1 be i.i.d. exponential with mean o, so that X —> cr. Let T - £ l,X 

i=i 



96 A First Course on 


Parametric Inference 


be consistent for a Show that E(T-a? is minimized for h-h-... I, = 
Consider 7* = Show that 7 * A <r and MSE (T*) < MSE (X) although 

X is MVUE of a Obtain consistent estimator of R(a,6) - e - P a (X > a y 

4 Let {X,}^ be i.i.d. Poisson with mean A so that X—>A and e * — 
Obtain E(e-*) and Var (e~*) and compare these with corresponding expressions 


for MVUE of e~ k given by <p(T) = 


( n- 1 
n 


) 


(Hint: T = ZX, ~ Poisson (nA)) 


5. We have seen that in (/(0, 9) case, T = -^ is MVUE. Show that T 


IS 


consistent and so also is - T=X {n) . Compare the MSE(7) and MSE ^ + ^ T 

Show that X(„_d is consistent for, 0 but X(u is not. 

(Hint: To prove that X (1) is not consistent calculate P e [I X ( o - 0 I < e] and show 
that there exists a pair (#o, £o) f° r which the above probability does not go to one. 
You must avoid the trap of wrongly claiming inconsistency of X(d by showing 
that £(X (1) - 6) 1 = MSE (X (1) ) does not tend to 0 as n Recall that if Y n 

converges in quadratic mean to zero then Y n —> 0 but we have situations in 
which Y„ 0 but E(Y n 2 ) does not tend to 0 e.g. take Y n = 0 with probability 

1 - — and n with probability 1/n then Y n —> 0 but E (Y n 2 ) —> 

6. Let Xij = ^ + Eij, i = 1, 2, .... kj = 1, 2, n, the model used in balanced one 
way analysis of variance to analyse k sample problem. For {£,-,} i.i.d. N(0, a 2 ) we 
have seen in Example 4.3.1, that T = X i, ..., X k , S 2 /(n - 1 )k)' is M-optimal for 

(/ti, .... p k , o 2 )' = 0 where S 2 = Z S 2 and Sf = Z (X, y - X,) 2 . Show that the M- 

j =i 

optimal Estimator is consistent. Further even if we only assume that {£,-,} are 
i.i.d. with E(Ejj) = 0, Var (£/,) = o 2 without assuming their normality, show that 
T remains consistent for 0. 

7. In Neyman-Scott problem with Xy = fi ( + e ijt i= 1, 2, .... kj = 1, 2 where {£y} 

_ k 

are i.i.d. N( 0, g 2 ), we can show that T = (X u ...,X k , Z Sf/k)' is M-optimal 

— / = 1 

for 0. Show that when k °°, X, is not consistent for i = 1,2, .... Jk although 
'LSj/k is consistent for g 2 , and thus T is not consistent for 6. 


5.2 Method of Moments 

The Reader would have noticed from the previous section that using WLLN 
in case where {X,}" are i.i.d. is one of the simplest method to generate 
consistent estimators. Let (Xj, ..., X n )' be a random sample of size n from 

{fix, dfdeQcz^}. Then if £(X,-) = n(0) then X A ju(0) and X is 
consistent for /i(0). If ^(0) i s such that fT l exists and is continuous then 
by f Theorem 5.1.1, property of invariance under continuous transformation, 
At (X) —> /fV(0)) = 0 and pr\X) is consistent for 0. This is the well 
nown method of moments where we obtain an estimator by solving the 
moment equation X = /i(0). A sufficient condition for /i -1 to exist and be 


i 

i 


i 

) 


J 

1 

i 

* 

i 






Consistent Estimators 97 


dp 

continuous is that * 0. Note that we could have used transformed 


dr\ 


observations Y ( = U(Xj) if E(Yj) = j]($) is such that * 0 to obtain a 
consistent estimator based on {7,} given by 77 _1 (F). 

Example 5.2.1 Let {X,} be i.i.d. exponential with pdf given/(x, 6) = 

e e' e \ e> 0, X > 0. Here E(X) = lie and ^ = - -L * 0. Hence the 

do e 2 

moment equation based on {X,} gives = -L as a consistent estimator 
of 6, the failure rate or the reciprocal of tfie mean life time of the r.v. X. 
Example 5.2.2 Consider the Pareto distribution with pdf f(x, A) = ^ +1 , 

X 

x > 1, X > 0, which is used as a model for income distribution. As observed 
in Example 5.1.4, E(X) does not exist if 0 < A < 1 and unless apriori we 
know that X > 1, the method of moment will fail. However, suppose & = 

A ... j dp 1 


(1, °o) then E{X) = jU(A) = 


, d\x 

x-i and ^r = - 


0. Therefore the 


a - D 2 

yl V 

moment equation X = -r—determines the estimator T = ■=£— which is 

consistent for A. However if £2 = (0, °°) then the method of moments can 
not be applied using original observations themselves. Under the 
transformation y = log x we observe that {f(y , A), A > 0} is one parameter 
exponential family with k(x) = log * and Y = log X is exponential with pdf 

g(y, A) = Xe~ Xy , A >, y > 0. Now by Example 5.2.1, 


n 


n 


consistent for A. 


2 Yi £ log X, 


is 


Example 5.2.3 Let X have a Weibull distribution with parameter a so that 
J{x, a) = ax a ~ l exp {-x a }, x > 0, a > 0. Thus X is such that X a is standard 
exponential with mean one. Weibull distribution is quite frequently used as 

a model for failure time distribution. Here E(X) = T f — + 11 and the 

a 


moment equation is given by X = T| — + 1 ] = ju(a). Here ju(a ) = 

GC 


ri ¥ +i 


A 


is such that n 1 does not exist although — + 1 is monotone 

GC 

decreasing function varying over (°o, 1) as a varies over (0, <»). Let 9 = 

~ + 1 then 9 e (1, «>). Now HI) = F(2) = 1, and as I\9) is continuous and 

differentiable by Rolle’s Theorem there exists a 9 e(l, 2) such that 

dT ( 9 ) du , 

= 0. Thus there exists a e (1, °°) such that = 0 and pr does 

not exist and therefore the method of moments based on X does not work 


1 




98 A First Course on Parametric Inference 

unless a priori we know that as (0, 1). To avoid this problem suppose We 
consider the k-th moment of X then £(**) = r|^— + lj and as a varies 

over (0, ~) whatever be the value of k, ± + 1 will vary over (1, «) which 

k \ 

includes the interval (1, 2) and n'Jcc) = + lj will be such that 

_ q ^ or some ot e (1,2) and the moment equation will fail to give an 
d 

estimator. In fact this failure is due to non-uniqueness or non-existence of 
solution to the moment equation m' k (a ) = + lj. A detailed study using 

simulation has been carried out by Paranjape (1994). It is observed that the 
minimum value of rf4- + ll = .8856 and occurs around = 0.46. 


If observed m' k < .8856 then moment equation has no solution, and if 
m' k e [.8856, 1] then the moment equation would have two roots. If observed 

m k > 1 then only the moment equation has the unique solution. To get 
around this difficulty consider mgf of log X, given by E(e f log *) = E(X l ) = 

ft \ r'(l) -y 

r — + 11. This gives us £(log X) = ——— = where y is Euler’s 

constant. For properties of Gamma function T(1 + u) and its derivatives we 

refer to Abramowitz and Stegun [1968]. Thus ^-Elo gXj ^ and 

therefore a - - nylYj log Xj is consistent for a. There is still some 
problem with the possibility that a may be negative if X log X, > 0. 

Now {log X/}" are i.i.d. r.v.s with mean - — and variance therefore 

a 6 cr 

by CLT as applied to {logX,}”, X log X,- for large n is approximately 

9 l 

normal with mean = and the variance = Hence, 

_ w n6a 2 

P — ^\ogXj > 0 - 1 - 0 • V6 n -» 0 as n -» oo. Thus for large 

samples there will be in general no problem in obtaining moment estimator 
based on log X. 

In case parameter 6 = (0,. ej, is vector valued statistic and (/(*, 9), 

c K m ) is such that there exist m-functions {(7’,(X i ), .... TJX ,)) such 

that E(TfXi) = y,X9). y= 1, 2, ... m. Then by WLLN, i I TJx,) = f,W 

n ,=i 





Consistent Estimators 99 


is consistent for If { ¥h .... are such tha( 

then the moment equations given by 


d(0 h ..., 0„.) 


£(*) = VtfS). r= 1, 2, ..., m 

determine the estimator (d h ..., 0 J' which is consistent for 6 since the 
implicit function theorem is applicable to the above system of equations. 
Indeed originally the method of moments was proposed by Pearson to 

determine the parameters occurring in distributions whose pdf is determined 
by the differential equation 


d!°g fix) _(* _ a ) 

3* b M +b l x + b 1 x 2Xe( - C,d) 

K. Pearson showed that under some regularity conditions which guarantees 
existence of first four moments the constants (a, b, c, d) are determined 
uniquely by (MbM 2 .M 3 .M 4 ) or equivalently by (Mi, M 2 . A, A) i-e. mean, 
variance, skewness and kurtosis. K. Pearson also prepared tables of (A, A) 
using which we first guess the type of the curve and then estimate (a, b, 
c, d). Thus for a real valued data (jq, x n )', Pearson technique consisted 
of. (i) calculating (mj, m 2 , m 3 , m^) and then (mf, m 2 , A, AX and (iQ using 
sample skewness by and sample kurtosis b 2 and using Biometrika tables 
(first prepared by Pearson) guess the type of the pdf and then estimate 
(a, b, c, d). This was a standard procedure used for a long time. Our 
formulation allows us to select appropriate functions (T h ..., T m ) rather 
than (X, X 2 , ..., X m ). 

Example 5.2.4 Consider Exercise 5.1 (1).. Here X is Gamma with pdf 
f(x, < 7 , X) = e x,a ^ *’ x > cr > 0, A > 0. Now Mi = Ac and 
M 2 = A(A + l)© 2 so that 




c A 

d (A, c) 


c 2 (2A+ 1) 2oil (A + 1) 


= c 2 A > 0 


and the moment equations m[ - Ac, m' 2 = A (A + 1) c 2 determine (A, c ) 

ITl') — YYl\ ^ ^ 

uniquely which are given by c =-A = — 1 which can be 

m \ m 2 - m[ 

s hown to be consistent for c and A respectively. However, if we do not 
know that X is Gamma distributed then fitting of frequency curve can be 
done by using data to calculate (A, A)' and use these values to guess the 
type, say Pearson Type III, which is Gamma distribution. 

Pearson’s method of moments was used by Student (1908) to obtain the 
sampling distribution of S 2 = L (X,- - X ) 2 and that of t n _ { = X '\nl^[s 2 ln - 1 , 



100 A First Course on Parametric Inference 


known as Student’s t with (n — 1) degrees of freedom. Assuming that 
are i.i.d. N( 0, 1), Student obtained first four moments of S 2 and t n _ h their 
b 2 and guessed the Pearson types and then obtained their exact sampling 
distributions. The usual derivation of distribution of S 2 and t n _ x was derived 
later by Fisher (1922). Student’s 1908 paper is regarded as a major 
breakthrough in Statistics as it introduced the concept of exact sampling 
distribution theory. It is interesting to point out that Student obtained moments 

of r„_! by claiming independence of X and S 2 on the basis of Cov (X, S 2 ) 
= 0. Of course Cov (T h T 2 ) = 0 does not imply that T U T 2 are independent. 
But in Student’s case his conclusion was true (as later verified by Fisher) 
but his argument was certainly incorrect. 

Exercise 5.2 (1) Consider b{ 1, p) model where p = P[X = 1] is a function of 0, a 

situation that occurs in bio-assay problem. Suppose that at the dosage level d > o, 

p ed 

P[X = 1] = p{9) = ——jtt, 9 € Ri then obtain the moment estimator of 0 based on X. 

(2) Show that the moment estimator of 9 based on X in the following three models 
is same. Compare the variances of the moment estimator under these three models. 

(a) X ~ N(9, 1), (b) X ~ U(9 - 1, 9 + 1), (c) X with pdf/(x, 9) = j exp {- I x - 01), 

x e R\, 6 e Ri (Double Exponential Distribution). 

(3) Suppose that apriori it is known that X ~ N(9, 1) with £Io={0l-l<0<l}. 
Obtain the probability pg that the moment equation x = 9 would not have a solution in 
£2o. Show that pg — » 0 as n — » °° for 9 e (-1, 1) but at 9 = ± 1, pg — > 1/2. 

(4) For truncated Poisson distribution with pmf 

P[X = r] = e~ k Xlr\ (1 - e~ x ), r = 1, 2, 3. 

show that moment equation m[ = E(X) has unique solution. 

5.3 Method of Percentiles 

Suppose (X { , X n ) is a random sample of size nona continuous r.v. X 
from pdf {/(*, 0), 0 e £2 c /?i} then the 100p% population percentile £ p (9) 
is defined as the solution of the equation 


F(§/0),0)= f(x,e)dx=p 

J —oo 

where 0 <p < 1 or <*,(0) = Fg\p). Note that if the r.v. X is assumed to have 
a non-vanishing pdf at £ p (0) then £ p (0) is uniquely determined. Let r = 
[np] + 1, then the r-th order statistic of the sample X (r) can be shown to be 

consistent estimator of t; p (0) or X (r) §,(0), V 0 g £2. The method of 
percentiles consists in obtaining an estimator of 0 by solving the percentile 
equation” 

*(r) = &0) 


which admits a unique solution 




J 


Consistent Estimators 101 


We note that X (r) can be interpreted as 100p% percentile of the sample. 
Thus as the method of moments equates sample moments with corres¬ 
ponding population moments, the method of percentiles can be viewed as 
a method where we equate sample percentiles with corresponding population 
percentiles. 

To see this we note that if {X (1) , X (n) } is the order statistic (o.s.) 

of a sample of size n on a continuous r.v. X with pdf / and df F then 
t/ (r) = F(X (r )) has the same distribution as that of r-th order statistic for a 
sample of size n from 17(0, 1). Hence 


E(U {r) ) = 


n + V 


E(U 


2 r(r+l) 


(n + l)(n + 2) 


Therefore, E(U (r) - pf = (n + 2 ) ~ ^TTTT +p ' But “ ' = 

[np] + 1, np < r < np + 1 and therefore — —> p as n —> 00 . Thus 
E(U( r ) - p) 2 —> 0 as n —> °° and therefore by Tchebychev inequality 
f/ (r) p as n —» «> or F(X (r) ) p. However as F~ l exists and is 

» I iri 

continuous since ^ = /(jc) > 0 at x p = <fj p (0), we have by invariance 

property r'(F(X fr) )) = X (r) A r'(p) = ^(0). We note that for p = 1/4, 
3/4, we get lower and upper quantiles and p = 1/2 we get the median. 

Example 5.3.1 Consider (X lf ..., X„) a random sample of size n from 
N(6, 1) then for p = 1/2 we get X ([ „ /2 ] + i) = M n the sample median as a 
consistent estimator of 6 , the population median. In fact the same result 
will hold for a random sample of size n from U(6 - 1, 0 + 1) or Double 

Exponential (Laplace distribution) with pdf f(x, 6) = ^ exp {- I x - 0 I}, 

x e R Jt 0 e As seen in Exercise 5.2.2 in all these three case X, the 
sample mean is also a consistent estimator of 0. On the other hand in 
Example 5.1.4 we have seen that for Cauchy distribution with pdf 

f(x, 0) = ----«, x e R b X is not consistent for 0. However noting 

1 + (x - oy 

that 0 is population median in case of Cauchy distribution, the sample 
median M n would be consistent for 0. 

Example 5.3.2 We have seen in Example 5.2.2 that in case of Pareto 
distribution, if Z e (0, 1), the method of moments fails. However here 

£ P (A) = exp {- log (1 - p)lX) and X (M+1) exp {- log (1 - p)/X). As 
d^ p /dX * 0, is a continuous invertible function of A the percentile equation 
is given by 

*(M+D = ex P lQ g 0 ~ PM 

which yields a consistent estimator of A given by 



102 A First Course on Parametric Inference 



A = [- log (1 - p)V log *([,,p]+l). 


Note that the percentile method does not put any restriction on A such as 
As (1, «>) as is required for the method of moments. 

Recall that if X is Pareto with parameter A then Y = log X is exponential 
with mean 1/A or g(y, A) = Xe~ Xy , y > 0. Now 100p% percentile of Y is £, p {X) 
= - log (1 - p)/ A. Now percentile equation based on Y is log X ( [ np ^ X) = 


- log (1 - p )/A which leads to the same estimator A given above. On the 
other hand the method of moments as applied to Y gives the estimator 

Xlo ~ gX- ‘ ^ serve l ^ at y = x * s strictly increasing transformation and 

therefore (log . log X (n) ) is the o.s. of a sample of size n from 

g(y, A) with y (/) = log X (i) . 


Example 5.3.3 We have seen in Example 5.2.3 that the method of moments 
has various problems when X is Weibull with parameter a. Now observing 
that X a is exponential with mean one, we have 100p% percentile % p (a) = 

log [- log (1 ~P)] * ~ and a consistent estimator of a based on percentile 
method yields 


a = log [- log (1- p)]/log X (M+1) 

Consider in general a location scale parameter family with pdf J{x, p, a) = 

x e Rh V e Rh <J>0 where /o =A X > 0, 1), is the standard 
pdf. Let 0 < p < 1. Then = p + c p <7 where c p is the 100p% percentile 

fCp 

of the standard pdf given by p = f a (x)dx = F(c p ). Consider 0 < Pl < p 2 

J—oo 

<...<p k <l. Then the corresponding order statistic X ([npr]+l) is consistent 

^ p r ’ an< ^ we ^ ave k equations to determine estimators of p and < 7 . These 
are given by 


^(["PrJ+l) - + c p r f — 1, 2,..., k . 


We may obtain estimators of (ji, a)' by the method of least squares which 
are given by 



k 

^ r ?i C Pr X ([np r ]+l) 


k k 

I c 2) y 

r=! Pr r % A d«Prl+D 


k 

kl 


r=I Pr 




k 

r ?i X (lnPr]+ 1) 




k 


k 



Consistent Estimators 103 


Noting that + c Pr <y as n -4 ~ one can easily demonstrate 

that O ><rand/t > \x so that {jx,d) f is consistent for (/i, a). We 

emphasize here the fact that k is a fixed integer and the estimators p. and a 
arc b&scd on fixed number of k selected percentiles 
On the other hand m the range depending on parameter case such as 
m 0) considered in Exercise 5.1(5) we have shown that X (n) is consistent 
for 0, the upper end of the distribution. Similarly we can show that X m the 
first order statistic of the sample from pdf f{x, ju) = exp {- (* - M }, 
x Z p, is consistent for fi. This follows from the fact that £(X m ) = u + 1 In 
and Var (X ( d) = \ln so that E{X ([) - p) 2 -> o. Indeed if (X (1) , ..., X, n) ) is 
the order statistic from a continuous distribution with range (a, b ) then one 

can show that X (1) —> a and X (n) —» b. If the random variable X has range 
(- °°» ^hen X (/I ) > b but X (1) will diverge to minus infinity and if the 

range is (a, °°) then X(i) > a but X(„) will diverge to plus infinity. The 
proofs are immediate consequence of the fact that the distribution function 
of X w is given by [F(u)] n and that of X (1) is given by 1 - (1 - F(u)) n . We 
prove this for X^ and leave the results for X(i) as an exercise to the reader. 
Now G n (u ) the d.f. of X^ when the range of X is a bounded interval 
(a, b) is given by 

G n (u ) = 0 u < a 

= [^(«)]" a < u < b 
= 1 u > b 

Taking limit as n —» 

lim G n (u ) = 0 if u < b 
= 1 u>b 

which is the d.f. of the r.v. degenerate at b. Let e > 0 then 
P[b - e < X ( „) < b + e] = G n (b + e) - G n (b - e) 

= 1 - G n (b - e) 

But G n (b - e) -» 0 for any £ > 0 and therefore we have 

P[l X in) - b I < £] -> 1 and X (B) A b. 

On the other hand if range of X is (a, °°) then G n (u) = 0 if u < a and 
G «(«) = [F(u)] n , a < u and G{u) = lim G n (u) = 0 for any u e R { and 
p [X {n) <. u] —» 0 as n —> °o for any u e R\. This implies that P[X ( „) > u] —> 1 
as « co f or an y u e or 

Xn diverges to plus infinity. 

We now consider an example by which the distinction between the 
behaviour of extreme order statistics and that of central order statistics will 
be clarified further. 



104 A First Course on Parametric Inference 


a rand on 1 


Example 5.3.4 Let (X\, ..., X (2 „+d) be the order statistic of 
sample of size (2 n + 1) from U(0 - 1, 6 + 1) with pdf /(*, q\ ^ 1 

} "* J ’ ® •> ] 

< x < 9 + 1. From what we have seen earlier Z (J) 0 - 1 an( j % p 

6 + 1. So that 7*! = ^d) + 1 and T 2 = X^ 2n+l) - 1 are both consistent f ^ 
Further the sample median X (n+1) is consistent for 0 the population ° T 6 ' 
Routine calculations will show that the pdf of X (2n+l) i s gj Ven ^ Cdla 


n. 


g(x, 0) = (2 n + 1) (x - 0 + 1) 


2 n 


2n+l 


,0-\<x<Q+\ 


X — Q + l 

Let W = (2n+l) ^ - then the pdf of W is given by 

g(w) = (2 n + l)w 2n , 0 < w < 1. 

This gives E(w) = * and E(w 2 ) = ^- + \ 

2n +2 ' 2n + 3 

and 


Var (w) =-Sl- +1 ) 

(2n + 2) 2 (2« + 3) 

Now 


Hence 


E (T 2 ) - E(X( 2n+ i) - 1) = 2E(w) + 0-2 = 0- 

Var (T 2 ) = 4 Var (w) = 4(2/1 + 1) 

(2n + 2) 2 ( 2/2 + 3) 

E(T 2 - 6) 2 = - 4(2// + 1) j 

(2n + 2) 2 (2 n + 3) + faTIF 


/z + r 


0T+ l)(2/z + 3) 

Therefore T 2 1+ e si T 

0 . The distribution ,n . q “ adratic m <*n to ft 

•he reader can verify that T, A. a 7f?l T ‘ is almost identical and 
Now consider median M = v "£ mfact E < T < ~ S? = E(T 2 - Sf. 

" <n+l > then lts Pdf is given by 

*<><■«„ 9) = 


9 


Mn+i), 6) = ^-L (*(«+D - 0 +\y { 

B ( n + 1, n + l) ^- 11 - *("+D ~ 0 + l) Y j 

* < * (n+1 ) < ^+lorif« jc /i ^ J 2 ’ 

1 or if we use w = “ 0 + l ' 

2 then 

S(w) a _ _ 1 

ni - w)» o < * < , 


Consistent Estimators 105 


This gives E(w) = \ so that £(X (n+1) ) = 0 and E(w 2 ) = 


B(n + 3, n + 1) _ n + 2 
7?(n + 1, n + lj 2(2n + 3) 


and Var(vv) = 


1 


4(2n + 3) 


so that Var (X ( „ + d) = 


4 Var (w) = 2n Vy Thus £ (*(»+D ~ 0 ) 2 = 2r^+3 ° aS " °°* X( " +1) 

is also consistent for 9. We however observe that the MSE ( T \) = MSE ( T 2 ) 
tends to zero at the rate of 1/n 2 but MSE (X (n+1) ) tends to zero at the rate 
of 1 In. 


5.4 Choosing Between Consistent Estimators 

To choose between two unbiased estimators we have used the criterion of 
smallness of variance. In a similar way assuming that mean squared error 
i.e. E(T - 9) 2 exists, we use the criterion of smallness of mean squared 
error (MSE) to choose between two consistent estimators. Thus if T\ and 
T 2 are both consistent for G then we would prefer to T 2 if MSE (Tj) < 
MSE (72), V 9 € £2 and there exists an n\ such that V n > MSE (T{) 
< MSE (T 2 ), V 9 € Q. Thus if T\ is preferred to T 2 then by Tchebychev 
inequality it follows that P[l T x - 9 I < e] converges to unity faster than 
P[l T 2 - 9 I < e] —^ > 1 as n ► «», V 9 € £2 and any e > 0. Therefore the 
minimum sample size (see Section 5.1) to achieve the level of accuracy 
specified by (£, 8) for 7’ 1 would be smaller than that required for T 2 for 
every combination of values of e, 8, 9. We consider a few examples to 
illustrate this approach based on the rate at which MSE (7)) —» 0 as «—><». 

Example 5.4.1 Let {X,-}J be i.i.d. N(9, d 2 ) then as seen in Example 5.1.2, 
S 2 /n and S 2 /(n - 1) are both consistent for o 2 where S 2 = £ (X f - X) 2 ~ 

c?xl-\- Now 




2<t 4 <t 4 (2« -1) 

n -1 n 2 

cr 4 <l - in) 


< 0 for any n > 2 


106 A First Course on 


Parametric Inference 


Therefore we prefer 
a n > 0 such that a n 


Sl t o -Ay. Consider in general T, = a„ where 

[ a " „ -» Then T, is consistent for d, as 


2 '\ 


MSE(7|) = Var I a„ 


f 


J 


bias 


V 


n - 1 


20 * 0,1 


n - 1 


- l) 2 cr 4 



2 a 


in - 1) 



Reader may verify that for a n = - we get T { = S 2 /«, and for < 2 „ 

f V 

T, = S 2 /(n + 1) with MSE = 2cr 4 /(n + 1) and we have 


n ~\ 

n + r 


MSE 


M 


2 A 


\^/z + 1 


< MSE f— 

\ n J 


2^ Z' c2 

<MSE' 0 


« - 1 


One can show that within the class of consistent estimators given by 


-JL 

n + k 


for k a fixed real constant, MSE 


S 2 

n + k 


is minimum when k = 1. We leave 


this as an exercise to the reader. 

Yet another way of comparing MSE (T\) and MSE (T 2 ) is to expand the 
expressions in powers of 1 In and compare the coefficients. For example 
consider 


MSE 


(A)- 


2(7 


n - 1 


2 a A L P 

n \ tij 


-l 


2(7 

n 


i 1 1 

1 + — + —T + . 


n 


n 


MSE 



0*(2n - 1 ) 4 

f- —) 

V « J 

n 2 

< n n 2 j 


MSE 


/ c2 \ 


n + 1 


2 ex 4 2cr 


J 


n + 1 
2(7 4 


n 


4 / 1 A 

1 + - 
n 


-i 


i 1 1 

1-+ —r + ... 




n 


We see that the coefficient of is 2 <7 4 in all the three expansions but 
coefficients of -j in the three cases are +2, -1 and -2, respectively, with 

a common factor cr 4 and therefore MSE -> 0 at a rate faster than 

that of the other two. Note that to define an estimator based on S 2 we need 
n > 2. 


Consistent Estimators 107 


Example 5.4.2 Consider a random sample of size m from U{6 - 1,0+1) 
a situation considered in Example 5.3.2 for m = 2n +1 an odd number. 
Now E(Xj) = 0^ Var (X,) = 3 so that the sample mean X m is consistent for 
0 with MSE (XJ = l/3m = Var (X m ). Let (X (1 ), X (m) ) denote the order 

statistic of the sample. Then y {i) = - then (K (1) , ..., T (m) ) is the 

order statistic of a sample of size m from (7(0, 1) and the pdfs of y (1) and 

y(m) are 

gi(u) = m( 1 - 0 < u < 1 

&»(*>) = mv m ~\ 0 < V < 1 

and the joint pdf of y (1) , y (m) is 

*>) = w(m - 1) (v - u) m ~ 2 , 0 < u < v < 1. 

Now routine calculations (which we request the reader to carry out) will 
show that 


£(I W = mTT’ = ~m+l 


Var(V (1) ) = Var (F (m) ) = 


m 


(m + 1 ) 2 (m + 2) 


Cov (F ( |), T (m )) - 


1 


(m + l) 2 (m + 2) 


Therefore £(*«„ + 1) = * + ^ Var(X (1) + 1) = - - ^ - 2) 

- « = 0 ' Var(X « - » = (w 7 feT 2) 


Cov(X (1) + l,X ( „ ) -l)- (m + 1)2 1 (m + 2) 

It now follows that 7^ = (X (1) + 1) T 2 = (X (m) - 1) are both consistent for 
6 with MSE (T,) = MSE (I 2 ) = (m + 1)(m + 2 ) - Further 73 = “4^ has 

bias zero and Var (Ti) =-- 4m + — -so that 7 3 is also consistent for 

v 3 ’ 2(m + l) 2 (m + 2) 

Expanding MSE (XJ and MSE (7’,), i = 1, 2, 3 in powers of 1/m we 
observe that 



108 A First Course on Parametric Inference 


MSE (XJ = 1/3m 


MSE (7)) = MSE (T 2 ) = 




MSE (TJ = 


(4m + 1) 
2m 3 


a 



-l 





Note that MSE (XJ -> 0 at the rate ^ and MSE (7^) = MSE (T 2 ) 0 

at the rate 8/m 2 whereas MSE (r 3 ) -> 0 at the rate 2/m 2 . Thus fastest 
convergence here is for MSE (TJ and T 3 is the best and X m is the worst 
among these four estimators. 

We now consider the estimator M m = X ([m/2]+1) the sample median. Now 
~ X( n + 1 ) if m = 2/i + 1 or m = In but the distributions of M m in two cases 
are different. The case m = 2n + 1 has been considered in detail in Example 
5.3.2 hence we will consider the case m = 2n only. The pdf of M m in this 
case is given by 


g(u) = 


1 


B(n + 1, n ) 


u-9+ lYfj u - Q + 1 



n-i 


0~1<M<0+1 


Again substituting 


M m - 0 + 1 


- W we have pdf of W given by 


g(w) = 


WTln) w ' ,(1 - 0 < w < 1. 


Thus 


E(W) = 


_ n + 1 


2 ^Tl and Var W = JtT„ + 1)2 «> ‘hat 




Var (W m ) = 4 Var (IV) = —— 

(2n + l) 2 



Consistent Estimators 109 


Thus 


MSE (M m ) = 


1 


m + 1 
1 


if m is even, 
if m is odd. 


-l 


f 1 

fl Y| 

o 

+ 

■1 

1 

l-H 

"V 


[ m ' 

V m )\ 

2 ■ 

ml 

h — + o 

— r 

m 

l m ) j 


m + 2 

However, MSE (MJ -4 0 as m -4 « and Af ffl is consistent for 9. 

-1 i f 1 (\ M . 

if m is even 

if m is odd. 

Thus M m is least efficient as MSE (M m ) -4 0 at the rate of — whereas 

MSE (X,„) -> 0 at the rate of The following table gives the minimum 

sample size m 0 to achieve accuracy level £ = 0.1 and 8 = 0.1 for all the five 
estimators. This table can be obtained by using the fact that m 0 is determined 
by inequality MSE (7/) < 8 e 2 and the expansion of MSE (T t ) in powers of 

J_ 
m' 


Estimator 

x m 

A L 

T\ 

r 2 


Wlo 

334 

1000 

90 

90 

45 


We have already noted in Example 5.1.1, that the number n 0 (e, 8, 9) 
given by using Tchebychev inequality could be much higher than the number 
obtained by using the large sample or asymptotic distribution of the estimator. 
Let T be a consistent estimator of 9 then we want to estimate 

p n (e, 9) = P*[l T - 0 I < e] = F n (6 + e) - F n (0 - e) 

assuming T to be continuous r.v. with continuous distribution function 
F n (u, 9) = P d [T < u] where T is the estimator based on sample size n. As 

T 9 we have 


lim F n (u, 9) = 0 if u < 9 

n-¥oo 

= 1 if u > 9 

and lim p n (e, 9) = 1 for any £ > 0. Note that convergence in probability 

rt—> oo 

and convergence in distribution are equivalent when the limiting distribution 
function is degenerate. Thus convergence in probability by itself will not 
provide us any information about the rate at which p n (£, 9) —> 1 as n -» 




110 A First Course on Parametric Inference 

. « information about the rate at which p n (e, 9) 1 We m 

To obtain such an mfonn Reca ll ^ fof <,„ > o, />[| 7 - * 

obtain approximation ^ •_ ^ Thu$ if we can find a ^ 

o > o, » fha?ifmieing distribution of «„(r - 0 ) is non-degenerate or y„ „ 

" ST - 0) d > Y where Y is a non-degenerate r.v. then p n (E, 9) ~?[| y ( 
£) - G(- a n e) assuming that Y has its d.f. is continuous. For 

example in case of Ud. with W = « «nd Var (X,) = o 2 by WLLN 

jf Z* g and by CLT * (0,« or if a. = Vn, p„(e, 9) s 

&Nn do) - <P(- Vn £/< 7 ) and the rate at which p n (£, 9) —> 1 can now 
be studied by known results about the rate at which <P(yn e) -» l and 
o. For example, for large a, it is known that 1 - ^ 

~ I (.(a) = 1 Therefore in above case p n {£, 9) ~ 20(^1 n do) - i 

a r « ^2n 


1 *< fl) 


2 2 

and p„(6, 0) —> 1 at the rate of ^ 

Observe that by considering - 0) rather than (X -9) we have a 
non-degenerate limiting distribution of the sequence of r.v.s. under study. 
Multiplication by the factor a n in general and V« in particular blows the 
infinitesimal r.v. ( X -9) to ^ln(X -9) a non-infinitesimal r.v. This is 
similar to studying a phenomenon that cannot be studied by naked eyes 
only with the help of a microscope or studying a map obtained by an 
airborne camera by enlarging. Note that we have to blow the picture just 

enough. Thus in this case a n = Vn is just right as n 5 (X - 9) 0 if 0 < 8 

< 1/2 and n 5 (X - 9) will diverge to infinity if 8 > 1/2. It is only for 8 = 
1/2 the r.v. ( X - 9) is blown just enough so that n s (X - 9) N( 0, o 2 ), 

a proper non-degenerate r.v. We illustrate this magnifying operation by a 
few examples and in the next chapter consider this approach in detail. 

Example 5.4.3 Consider a r.s. of size m from U(9 — 1, 9 + 1) and the five 
estimators considered in Example 5.4.2. We now consider the asymptotic 
distributions of each one of these i.e. use the limiting distribution of 

b n,(X m - 9),b (2) (X ([m/ 2 ]+l) - 9\ and a^{Tj --0), i = 1, 2, 3 where the 

constants a m and b m , b m ^ are chosen appropriately. A quick rule of thumb 
for the choice of such norming constants is to use them as proportional to 
•he square root of the MSB. Thus for X m and X ((m/2]+1) we use V«. Now 

q , ^W 3m > N(0,1) i.e. X m is Asymptotic Normal with mean 

theorem“n asymototic K e ?° ted ? y ~ AN(d ’ 1/3m )- Similarly the 
asymptotic distribution of 100p% sample percentile states that 

provided X&) * 0 X„_,.,, ~ A Jf Pd - P) ) ~ _ .. 

p] i) m[/(^ )] 2 ‘ F° r proof refer to David 


ne2n 




Consistent Estimators 111 


(1981). For p = 1/2 we have M m = X (tm/2]+1) ~ AN(6, 1/m) as /(<^i/ 2 .) = 1/2 
in 1/(0- 1 > 0 + 1 ) case. Thus by using asymptotic distributions of X m , M m , 
we get mo(e, 8, 0) as [[(^(l - 8/2)] 2 /3(?] + 1 and [[<r l (l - ^2)] 2 /6 2 ] + 1 

respectively. For £ = 0.1, 5 = .1 we have m 0 (X m ) = 91 and m 0 (M m ) = 271. 
For 7*1, T 2 we can use the exact distribution for each m. Consider T { - 6 = 
(X(D +1-0)* Observe that T\ - 0 = 2y ( j) where y (1) = (X ( d + 1 - 0)/2 has 
pdf gi(y ( n) = m(l - y ( i)) m_1 , 0 < y (1) < 1. Since T x = X (1) + 1 > 0, we have 
P[\ T\ - 6 \ < e] = P[ 0 < y (1) < e/2] = 1 - (1 - £/2) m if 0 < £ < 1/2 and one 
otherwise. Thus m 0 (T{) is such that 

1 - (1 - £/2) m 2 1 - 5 or moCT,) = l ° g ^ ) /2) +1 if 0 < e < 1/2 

otherwise mo(Ti) = 1. For 8 = £ = .1 we obtain = 45. Similarly 

for T 2 - 0 = X (m ) - 1 - 0 since T 2 - 0 < 0 we have \ T 2 - 9 \ = 6 - T 2 = 
6 - (X(,„) - 1) = 0 + 1 - X m and 

P [I T 2 - 0 I < £] = P[0 - (X (m) - 1) < £] 

= P[X (m) > 0 + 1 - £] 

= P[2T (m) + 0—1>0+1— £] 

= P[Y im) > 1 - e/2] 

= 1 - (1 - e/2) m if 0 < £ < 1/2 

= 1 otherwise 


If £ < 1/2, m 0 (T 2 ) is given by 1 - (1 - e/2) m > 1 - 8 or m 0 (T 2 ) = 

+ 1, which for £ = 0.1, 8 = 0.1 gives m 0 (T 2 ) = 45, as one 

would expect. For £ > 1/2 we have m 0 (T 2 ) = 1. It would be interesting to 
consider the asymptotic distribution of a m (T y - 0) and a m (T 2 - 0) where a m 
is chosen such that the limiting distributions are non-degenerate. Since 
£'(7'i - 0) 2 is of the order 1/m 2 one could consider a m = m and indeed if 
Z m = m{T\ - 0) then 0 < Z m < 2m and the d.f. of Z m , 

G m (u ) = 1 - (1 - u/2m) m if 0 < u < 2m 

= 1 if u > 2m 


log ($) 

log (1 - £/2) 


To consider lim G m (u) for fixed u, we observe that for given u there exists 

/II—>oo 

an in such that 0 < u < 2m and G m (u ) = 1 - (1 - u/2m) m for m > m' and 
therefore lim G m {u) = 1 - e~ ul 2 , u > 0. Thus m(7’ 1 - 0) is asymptotically 

HI—» oo 

exponential with mean 2. Hence P[I (T y - 0) I < £] = P[0 < Z m < in £] is 
approximately equal to 1 - e~ m£/2 . Thus m 0 based on the asymptotic distribution 
ot m(T, _ Q) i s determined by the condition that 1 - e~ mdl > 1 - 8 or m 0 

= [- 2 log 8/e] + 1. For e = .1 and 8 = .1 we obtain mo(Ti) = 47 which 



112 A First Course on Parametric Inference 

is quite close to m^T,) = 45 obtained by using the exact distribution ofr 
Note that as log (I - ell) = -e/2 + 0(f/2), the values of m^T x ) and '• 
are bound to be quite close. ^ 

Now consider T, = ^ = * 0) - ' \ ~ ' so that r 3 = 5^ 

and T,-0 = Xt>> ~ e+ 2 X( " l> —• I» «™s of Y u) = hllJLtl We ^ 
ft - 0 = [y<i> + y«»>- •] «“d 


P[\ ft - 9 1 <£] = /»[!- e < y (l) + y ( „„ < 1 + e] 

= IT m(m - 1) (y, m ) - y,,,)"'" 2 dy m dy (m) 


where A is the shaded region. Therefore, 


(0,1-e) 



( 0 , 0 ) 


1-g 1-g 
2 ’ 2 


% * * . 

J- 

*<. 

'H. '**>, 
V V 


Fig. 5.1 Evaluating P(1 - £ < y x + y m < 1 + c) 

Required probability = 1 - ff g(y 0) , y (m) ) dy m dy {m) 

J *A\ 

- «Cv<i). y<».)) </V(D dy {m) 

where 4| is the triangle formed by (0,0), | ^ ~ E , - i - T— j and (0, I - e)and 
4j the triangle formed by (1, I) i±i.J and (£> ,). Now 
















Consistent Estimators 113 


«*).<! 

as can be seen by integrating the pdf over A { and A 2 . Therefore 
P[\ - E< y (1) + y (m) < 1 + e] = i _ (i _ e yn 


log 8 


+ 1 . 


Thus mo(h) is given by 1 - (1 - e ) m > 1 - 8 or m 0 (r 3 ) = „ , x 

| log (1 - £) 

For e = o.i and 8 = 0.1, m 0 (r 3 ) = 22. 

To obtain asymptotic distribution of m(T 3 - 0) we see that 

P[l m(7, - 0) I < e] = P[| 7 3 - 0 I < £/m] = 1 - (1 - e!m) m 

and IT' 1 -0l<w] = l- e ~ u . Therefore for large m, P[\T 3 -6\<e] = 

p[l m(T 3 - 0) I < me] = 1 - e~ m£ and mo(r 3 ) based on asymptotic distribution 

of T 3 is given by m' 0 (T 3 ) = j ^~° g - 


+ 1 for e = 0.1, 8 = 0.1 we have 


m'o(T 3 ) = 24. This is quite close to the exact value of mo(r 3 ) = 22 as we have 

log (1 - 6) = - e + 0(e). 

The other interesting way to obtain this result is to observe that 
my { i) Z] and m(l - y {m) ) Z 2 where Z h Z 2 are standard exponentials. 

Further y(D and y( m ) can be shown to be asymptotically independent. Hence 
P[ I y (1) + y (m) - 1 I < e] = P[l Z\ - Zi I < me] where Z h Z 2 are independent 
exponentials (see David, 1981). Now P[l Z\ - Z 2 I < me] can be shown to 

- log 8 


be equal to 1 - e me and m' 0 (T 3 ) = 


+ 1. Table 2 summarizes the 


results for all the five estimators for e = .1 and 8 = .1. 


Table 2 


Min \ 

sample n. Estimator 
size 

x m 

Mm 

r, 

Ti 

h 

Based on MSE 

314 

1000 

90 

90 

45 

Based on exact 

NA 

NA 

45 

45 

22 

distribution 

Based on asymptotic 

91 

274 

47 

47 

24 


distribution 


Exercise 5.4 (1) Let (X„ .... X n ) be i.i.d. t/(0, 9). Carry out similar exercise as in 

Example 5.4.3 for consistent estimation of 9 using 2X m , M m , T% — X( m ), T$ — m X^ m y 
N «e that here the minimum number on. depends on »in each case. Calculate m„ for 
05 I'd. 1/2, 1,2, 4. 








, 14 A First Course on Parametric Inference 


(2) Let (X„ .... X.) be U.4 Mft <r 2 ) let = Z(X ' ~ a2 Xl,. 
a general expression for 0 = MSB (c„S 2 ) and show that Min Q occurs at c„ = J 

/ A \ 





and MSE | I < MSE 


(S^\ 
\ n ) 


< MSE 


n - 1 


1 + K n J V"- V 

(3) Use the well known normal approximation to % k given by ~ AN(k, 2k) t 0 
determine no(£, 8, O 2 ) for 


c2 c2 . , 

t. = —, r, = —, T-t - — as estimators of <r z 

1 _ 1 * ^ M J M _L 


s 2 


(4) For the double exponential distribution with Mean = Median = 9, use CLT to 
obtain asymptotic distribution of X n and the theorem quoted in Example 5.4.3 to obtain 
asymptotic distribution of M„. Determine n 0 (e, 8, 9) for X„ and M n . 


Consistent Asymptotically Normal 
--Estimators 


6 i A Basic Result 

Let T be consistent for a real parameter 9 so that (T - 9) o, V 9 e Q. We 
have seen that in order to obtain an approximation to P[ I T- 9 1 < e] for large 
n, we blow up infinitesimal r.v. (T -9) by a factor so that a n {T - 9) ±>Z 
where Z is a non-degenerate r.v. We saw by way of examples that for random 

sample of size n or {X,}J i.i.d., a n = 'In is adequate for estimators obtained 
by method of moments or percentiles and Z is N{ 0, <r£(0)) so that T ~ 

(Xy>(0)/w). For extremes such as X^ or Xq) the blow up factor was n and 
limiting distribution was non-normal, typically exponential with mean ct( 0). 
When limiting distribution is normal, the minimum sample size wq(e, ^ 0) 
required for consistent estimator T to achieve degree of accuracy specified by 
(e, 5) is given by smallest n such that a n > g t { 9) €r\\ - 8/ 2)1 e. If V is any other 
consistent estimator such that T' - AN{9, g {9)1 a J), i. e . T' has the same blow 
up factor a„, then the corresponding number n' Q {e, 8, 9) is given by smallest n 
such that a n > G r {9) {Qr\ 1 - 812))/e. We then would prefer V to T if G 2 , (0) 

< g 2 t (9), V0eO. The number g 2 t {9)/g 2 {9) is called as the asymptotic 
relative efficiency (ARE) of T with respect to T. If ARE (T, V) = a 2 ,{6)1 
o 2 t {9) < 1 V 0e ft then we prefer T to T. If the sequence of blow up factor 
is not the same, i.e. T ~ AN {6, G${9)Ja 2 n ) andT ~AN{9, G 2 r {9)/b 2 n ) then we 
choose T to T if rio(e, 8, 9) < n 0 (e, 8, 9), V 9 e ft and this would happen if 
a r^)/ a l ^ Gj,{9)/bl . The ARE of T w.r.t. T' is then defined as limit of the 
ratio G 2 r {9)b 2 n /a].{9)a 2 n . If the limiting distributions after the use of appro¬ 
priate blow up factors are normal, the choice between consistent estimators 
,s governed by their asymptotic variances. We therefore first concentrate 
on those consistent estimators which are such that their blown up versions 

a„(T- 0) i, W(0, ffj(S)) or T-AN(9, o^(0)/a„ 2 ). Such an estimator is called 
as Consistent Asymptotically Normal or CAN estimator. 

We have seen in the previous chapter that if T is consistent and if i// is a 
continuous function then y/(T) is consistent for \\f{9). Similarly we have an 
Variance property for CAN estimators under a continuous differentiable 

tra «sformation i/a 



16 A First Course on Parametric Inference 


1 


THEOREM 6.1.1 Let r be CAN for 0 so that T - AN(8, ff^0)/ a 2) ^ | 


, L dip . . , " 

yr be a differentiable function such that -jq is continuous and nonvani^ 

then v<7) is CAN for y<0) and 8 




y{T)~ AN 


Cj{0) 


(^\,al 


dO i 




Since y/is differentiable, it is continuous and yKJ) is consistent for 


Now by mean value theorem 


m 


dw 


yKT) - \f<0) = (T-0) + R 


where the remainder R is such that I R I ^ M I T - 0 1 1+5 for some 8 > o. 
Now a n (yKT) - yK0)) = a n(T ~ 0) ~j^ + ^ a n anc * the theorem is proved 


d dll/ d 

if we show that I Ra n IA 0. Note that a n (T- 0) -» N 


0, cl (6) 


(dip 


\ 2 \ 


l dd 


since a n (T- 0) A W(0, <t£( 0)). Now \a n R\<M\ a n (T- 0)\\T- 01 5 and 

a n (T _ Q) AT(0, c}(0)), a proper r.v. and I T - 0 \ 5 A 0 so that I a n R | 

0. Here we are using the result that if U n A U and V n A 0 then U n V n 

A 0. For proof we refer to Rao (1973) or Cramer (1946). 

We now consider a few examples to illustrate the invariance of CAN 

estimators. 


Example 6.1.1 Let ( X h ...» X n ) bej.i.d. Poisson with mean X then by CLT, 
X ~ AN(X, XIn) and thus MVUE X is CAN for X. Consider estimation of 

ifiX) = e~ x = P(X = 0). Then ^ = - e~ x < 0, V X > 0 and is continuous. 


Hence by the invariance property y/(X) = e * ~ AN x , e 2X • n j- Note 


that the AV(e~*) = e~ 2X — = CRLB for estimating \p(X) = e x . We have 

/ n _ jy* 

already seen that the MVUE of e~ x given by «t\>(X) = [ - j- ] has variance 

.-A - 


strictly greater than the CRLB. Next consider estimating yr(X) - Xe ' 


P[X = 1], Here = e ~ X ( l -X)*0,X*l. Thus \y(X) = X e' x is CAN 

dA 


for Xe x with asymptotic variance — e~ 2X (l - X) 2 for any X& 1. At X = 1 th e 
CAN ness of yi(X) may not hold. 


Example 6.1.2 Let (X, . X„) be i.i.d. NQi, 1) then X ~ nU, “d |S 



Consistent Asymptotically Normal Estimators 117 


d \\r 


CAN f° r M- Consider y/(p) = /x. Then = 2p * 0 for any p except 
u - 0. Thus X 2 is CAN for p 2 for all /x * 0 and AV(X 2 ) = ^- = CRLB 

r /l 

for estimating p • Consider Vn(X 2 - /x 2 ) at /x = 0 i.e. Z = Vn X 2 . Then as 
X - N(0,. 1M) under p = 0, nX 2 ~ £, 2 V n > 1 and thus nX 2 4 X? at 
^ s 0. Therefore Vn(X 2 - /x 2 ) at /x = 0 behaves like X\^ n which converges 
in probability to 0. Therefore here the inflating factor would be a n = n rather 
thaii a„ = 4 and the asymptotic distribution of n(X 2 - p 2 ) is %\ at /x = 0 
which is same as the exact distribution. This example illustrates that the 
sequence a n need not always be Vn and the limiting distribution is not always 
normal 

The phenomenon observed in the above example can be generalized and 
stated as a theorem as follows. 

THEOREM 6.1.2 Assume the frame work of Theorem 6.1.1 but suppose that 
d\if d 2 \lf 

■jg* = 0 and * 0 at 0 = 0 O and is continuous then al(\ff(T) - iff (do)) 


d o£(0 o) 


( 


V 


d l \\r 

Id 


xl 


7 0=00 


By Taylor series expansion of x\f(T) around y'(flb) we have 

a] (W) - = ^ '*° )2 fel + alR 

^ s 9=6 o 

where \ aj t R \ < M a 2 n (T - 0 O ) 2 I T - d 0 I 5 for 8 > 0. 

Now a n (T - 0 O ) 4 N( 0, <7f(0 o )) therefore 

a](T - 6 0 ) 2 d al(6 0 ) , , , P 

- 2 -> —j— Xi an d a 2 R —» 0 since I T - 0 O 


,5 p. 


0. 


<^(0 O ) 


2 

K d9 2 j 


Xi- 

7 0=00 

Example 6.1.1 ( Contd .) 

For yr(A) = Xe~ x we have for any X* 1 Vn( Xe _ * -A*' a )4 A(0, X(l-X) 2 e _2A ) 

^d 2 y 

=-e ‘.Therefore, 


V 




7a=i 


as ’Jf = ~ A). However at A= 1 = 0 but 

at; t=l,n(Xe^-0-»)4I(-e- 1 )Xi 2 . 

The selection of norming constant a n so that a n (T - 9) 4 7V(0, a 2 ,(0)) 
ls a tricky job. We have already seen that for (X t , ..., X„) i.i.d. (7(0, 0), 


118 A First Course on Parametric Inference 

T = X {n) is a consistent estimator for 9 , and n(X^ n) - 9) —» -Y where y j s 
exponential with mean 9. However there is no choice of a norming sequ ence 
a n for which a n (X {n) - 9) would converge to a normal r.v. This follows f rom 
the observation that no matter how we choose a n , P[a n (X( n) - 0) £ #] a j 
for u > 0 and the limiting distribution of a n (X^ n) - 9), say G{u) is such that 

( u \ 

G(u ) = 1 for u > 0 and G(u) * 0 J for u e (0, «,) for any i n Case 

observations are on r.v.s which are non i.i.d., behaviour of limiting distribution 
of a n (T - 9) can be very complex. We illustrate this phenomenon by way 
of an example where observations are independent but not identically distri¬ 
buted. When dependence is introduced in the sequence {X,}~ the behaviour 
of a n (T - 9) can be still more complex. However we will not pursue this 
problem in this introductory course. 


Example 6.1.3 Let (Xj,..., X n ) be independent r.v.s where X, = 0 + e,. where 
e j ~ N(0, i 2a ) where a is a known constant. Now £(X,) = 9, and E(X n ) = 

9 and X - n{o, \ 2 H- Let bl = \ £ i 2a . Now P[ I X — I < e] = 


n 2 i=i 


n 2 1=1 


p \JLz ft ! < _L = 2&(-r- - 1 * where b n > 0 is the positive square root 
L *> n b n \ \b n ) 

of b 2 . Now 


b 2 = \ £ i 20 (//„) 2“. a 


2a+l 


2a-l 


n 2 i=i 


n i 


2 cm- 1 


for large n, 


as lim - I (i/n) 2a = f x 2a dx = ThuS f ° r ® < T’ b " ° 

n—>oo H j —i Jq 

and P\ I X - 9 1 < e] -» 1 as /* -> «>. For a = 1/2, lim P[ \ X - 9\<e) 

= lim 20{elb n ) - 1 = 2d>(V2e) - 1 as lim b n = 4-. For a > 1/2, b n -> 00 

,7-400 VZ 




/J—>°° 


as w oo and 2# f-r-1 - 1 -> 0 as n -> and lim P[ I X - 9 1 < £] = 

11?„ J 


Thus X is CAN for 0only when a< j. For a> X is N(9, b 2 ) but Xis 

r 

not consistent. This is because in case a > j the entire probability mass o 
the distribution of X escapes to infinity and sequence of d.f.s of X does not 

converge to a d.f. For a = 1/2, X N 0, ]r + —) and X N 0, 'J • ^ 

y l n j \ / 2 

that (X - 9)/b n for each n > 1 is N( 0, 1) and therefore (X - &)' n 
A^(0, 1) for any a as n —> <» but the behaviour of P[ I X - 01 < £] d^ ere 



Consistent Asymptotically Normal Estimators 119 

for different values of Of and depends on whether b n —> 0 or b n —> 1 or b n —> 

oo according as Of < , a = -^ and Of > respectively. 

In this course we will mostly be concerned with the case of random samples 
i.e. (X„ X n ) are i.i.d. and the choice of a n = Vn would be appropriate in 
most situations. We next consider the method of moments and percentiles to 
generate CAN estimators. 

Exercise 6.1.5 (1) Let (X[,X„) be i.i.d. exponential with mean 0. Using CLT show 

that X is CAN for 6 and obtain a CAN estimator for P[X > r] = e~ ,ld and its asymptotic 
! variance. 

I (2) Let (Xj, ..., X„) be i.i.d. b{\, 0). Show that X is CAN for 0. Let y/(0) = 0(1 - 6). 
Show that X (1 - X) is CAN for 0(1 - 0) for all values of 0 except 0 = 1/2. What is the 

asymptotic distribution of X(1 - X) at 0 = — ? 

2 


6.2 Method of Moments and Percentiles 

Let (X„ ..., X„) be i.i.d. r.v.s with E(Xj) = fx(0). In Chapter 5 we have seen 
that if p' 1 exists and is continuous then p~ [ (X) is consistent for 0. A sufficient 

condition for jUT 1 to exist and be continuous and differentiable is 


dO 


* 0 and 


is continuous. Now suppose that (X„ ..., X tl ) are i.i.d. such that £(X,) = n(G) 
and Var ( X,) = a 2 (0). Then by CLT Mx - ^6)) -4 N(0, a\6)) or X is 

CAN for n(B) with asymptotic variance — — . Now suppose that — * 0 

n d6 

and continuous. We can claim by Theorem 6.1.1 that jLT l (X) is CAN. 
Reparametrizing the problem by the transformation 0 = jU(0) so that 0 = 


/i l (0) we have X ~ AN I 0, 


/ 




\ 


V 


n 


o- d(b dO 

• Since ^~ = lwe have 


dO _ d(jj. ‘(0)) 1 _ i T? _ ^ 

d0 d(j) “ d(j)/dO ~ (dfi/dOy ForTheorem 611 t0 be applicable 
4* _ '(0) ....__ •_. ... fdn'~ [ 


we need that 


d(f) 


exists and is continuous or equivalently 




exists and is continuous then jU (X) ~ AN 

r ->. / / . x 2 \ 




a(H "'( 0 )) 


V 


n 


ydO 
dn-\6) 


hi. -law f j .. —)\\ 


V 


J 


°r JT l (X) ~ AN 


0 , 


a 2 (0) (dll 


V 


n 


dO 


J 


We thus have the following theorem. 


THEOREM 6.2.1 Let {X,}" be i.i.d. with E(X,), = fi(0) and Var (X,) = o 2 (0) 

such that [dpJdO]~ l is non vanishing and continuous then fT l (X) is CAN for 
0 with 



120 A First Course on Parametric Inference 


AV(pr\X)) = 


cr 2 (0) /f 4* 


do 


In a similar way if §,(0) denotes the p -th percentile of X for 0 < p < i then 
assuming./!^) > 0, X([, J/7 ]+i) -AN £ p (0), )]2 or ^([«p+i]) CAN f 0r 


£ p (0) (David, 1981). Again if £ p (0) is such that is nonvanishing and 


continuous then 


E~HX ) an[q pUjlEL /(^l) 

Zp ) ~ AN n[f tf p )] 2 /{dd ) J 

Note that p~\X) and ^ p \X ([npW ) are moment and percentile estimators 
based on X. We could also consider transformed variables Yj = U(Xj) such 

that ^t~" ^ 0 and continuous. Then moment and percentile estimators based 
dx 

. . c (dE(y )\' JdS„(y)X' 

on (Tj,..., Y n ), would also have similar properties if ^ j and I - ^ 

are non-vanishing and continuous. We now consider some examples to illustrate 
the above technique of obtaining moment and percentile estimators. 

Example 6.2.1 Let (X 1? ..., X n ) be i.i.d. with pdf^x, 6) = 6x 0-1 , 0 < x < 1, 

e> °- Then m = eh and ff2(e) = (oTVieXT) ■ Now i = Teh? 


and = (0 + l) 2 > 0 and is continuous. Hence the moment estimator 

pr l (X) = -— j is the solution of the moment equation 

X = -zf-r and -Z-= - An( e, + 1)2 . 

0+1 i -x ^ n(6+2) J 

As can be proved (f(x, 0), 0>O} as given above is one-parameter exponential 
family with u(0) = (0 - 1), k{x) = log * and if T,- = - log Xj then g(y, 0) = 

Be**> 0 > 0, y > 0. Now E(y ) = 4- = H Y (6) and = - -Xr is such that 


dp y 


= - 0 2 < 0 and is continuous. Hence the moment estimator based 


a 7 

on log Xj is given by 0 = —=-p—— and is CAN for 0. Now Var (T) = 

- L log Xj 


Consistent Asymptotically Normal Estimators 121 


A.V(0) ~ ~ 71 / WO 2 ) 2 = ~~ and therefore 0 - AN(0, 6^/n). It can 


Hence 


be easily cheacked that > AV(9) and AV(0) = CRLB for 

estimating 0- 

Example 6.2.2 Let (X b ..., X n ) be a random sample from Weibull distribution 
with pdf /(*. «) = a*“ -1 e ~ x °> x > 0, a > 0. Then as seen in the previous 

chapter for 0 < p < 1 we have log £ p (a) = log [- log (1 - p)] ^ = c p /a. Now 


y \'l 2 

‘jjt I = — °L ■— is non-vanishing and continuous. Therefore Theorem 
da ] -c p ^(a) 


f t / \ p(l - P) > 

6.2.1 is applicable. As X ([np]+1) ~ AN ^pC«). 


p(l-p) 

7771 m 2 we have 


log [X([„p]+i)] 


the p-th percentile estimator is CAN for a with 


A v,a) = PQ--P)a 2 

nlHUfctfy 


We again point out here that the method of moments is not applicable in 

f\\ du. 

this example as £(X) = T I — + 1 = fi(a) is such that the condition * 0 

does not hold and the moment equation may have no solution or two solutions 
for several observed values of X, as noted in Example 5.2.3. 

Example 6.2.3 Let X* = 0 +£/,/= 1,2,...» n be i.i.d. with error distribution 
J[e) = I £ I, - 1 < £ < 1. Here E(e) = 0 and Var (e) = By CLT, X ~ 

An{ 0, As the error distribution is symmetric around zero, £ 1/2 = 0 and 

V 2n J 

X ([n/2]+1) is consistent for 0 as per results of Chapter 5. However as pdf of 
X vanishes at £ 1/2 = 0 the theorem quoted above is not applicable. On the 

other hand if we take p = 3/4 then X ( [ 3 „/ 4 ] + i) is CAN for £ 3/4 = 0 + and 
as = 1 and/(| 3/4 ) = ^ > °’ we have X d 3 n/ 4 ]+i) - ^ ~ j 


Similarly we have X ([n/4]+1) + ^ ~ AN (^0, ^ J What about CANness of 
T y = + ^(tn/4HD o We can study this we icnow the joint asymptotic 



122 A First Course on Parametric Inference 


distribution of X^j+o and X {[np2 j+d for 0 < ~p\ < p 2 < 1. This we shall, 
consider a little later. Before that we consider a generalization of Example 
6.2.1 to the case of one parameter exponential family, and in the next 
section consider CANness of a vector valued estimators for a vector valued 
parameter. 

Let (X b ..., X„) be i.i.d. such that f[x, 9) = exp {u(9) K(x) + v(9) + w(x)} 5 
x e S, 9 e Q a R\ which belongs to one-parameter exponential family. 
Then as observed in Chapter 3, K(x) is complete sufficient statistic for 

the family. Now K(x) + v' and = 0 implies that 

„ iV/ - v ' , d\ff u"v'-v"u' 

E(K(x)) = — = y<0) and -j { $ =--• 


Var 


<?log/ 


= 1(9) = E(u'K(x) + v' 

= w' 2 E(K(x) - (- v'/u')) 2 
= w /2 [Var K(x)). 


Hence a 2 K (9) = But 

IA 


m = e 


-d 2 log/ 


= E{- u" K(x) - v") 


U U 


d9 u ' 


d y/ _ 1(9) 
d9 u' 


i ( crl(O)) , (WX 

Let T= J-1 K(x,). Then by CLT, T~ AN yr(0), . Further -jj 

n ^ n J \ / 

is continuous and non-zero. The conditions of Theorem 6.2.1 are satisfied. 
Hence, 9 = \p~ [ (T) is asymptotically normal with the mean 9 and variance 


Vr‘(7) ~ AA^f 9, 


nl(9) 


. We thus have the theorem, 


THEOREM 6,2.2 If (Xj, ..., X n ) is a random sample of size n from one 



Consistent Asymptotically Normal Estimators 123 
parameter exponential family then the moment estimator 9 based on sufficient 


statistic is CAN for 6 with AV(9) = -i_ or 6 - AN 


t 


0 , 


1 \ 
) 


nl(9 ) 


even 


We observe that AV{9 ) = CRLB for unbiased estimation of 9. Thus 
if we have an exponential family in which there does not exist an unbiased 
estimator of 9 which attains CRLB for finite n, the moment estimator based 
on sufficient statistic is CAN and its asymptotic variance attains the CRLB. 
We illustrate this phenomenon by an example. 

Example 6.2.4 Let (X^ ...» X n ) be i.i.d. with/(jc, 9) = 9x e ~ l 0 < x < 1, 

0 > 0. Then as observed in Example 6.2.1, 9 = -^=-2-is the moment 

-Llog X: 

estimator based on sufficient statistic k{x) = log x. Note tnat log f(x, 9) = 


log 9 + (9- 1) log x and + log x and - ^ ^ so that 

/ (0) = and CRLB for estim ating 9 is . Now = 

^ + Z log Xj and therefore there does not exist an unbiased estimator of 9 

which attains CRLB. But AV(9 ) = CRLB for estimating 9. By using RBLS 
theorem. Reader should verify that MVUE of 9 exists for n > 1 and its 
variance is larger than CRLB for estimating 9. 

Exercise 6.2 (1) Let (X lt X„) be i.i.d. Pareto with pdf A*, X) = x > 1, A > 0. 

* X 

Show that the pdf belongs to one parameter exponential family and the moment estimator 
of X based on K{x) - log x has asymptotic variance equal to CRLB. Obtain MVUE of A 
and compare its variance with CRLB. 

(2) (a) Let/(x, 0) = jcg R h 0>O. Show that the pdf belongs to exponential 

family and the moment estimator of 9 based on K(x) - I x I is also MVUE of 9 whose 
variance attains CRLB for each n > 1 . 

(b) Show that X the sample mean is not consistent for 9 though it is asymptotically 
normal. 

(c) Using method of moments based on U(x) = x 2 obtain CAN estimator of 9 and 

compare its asymptotic variance with that of X I Xj I. 

3. (a) Variance Stabilizing Transformation was introduced by Bartlett (1937) to analyze 
the data in which the variance is a function of mean, for example as in case of the 
exponential, binomial or Poisson distribution. Let X, be i.i.d. Poisson then by CLT, 

X - AN ^ A, — . Bartlett suggested considering y/(X) such that \f/(X) ~ AN ^ yr(A), ^ 

A 

where c 2 does not depend on A.. Now by Theorem 6.1.1, AV(y( X )) = — 


2 ^ 

J 


/ ■ '2 


so,hal 


124 A First Course on Parametric Inference 


continuous and therefore 4X ~ AN ^VA, . We will see later that such transform 

are very useful in constructing large sample tests or confidence intervals for X 
(b) Determine variance stabilizing transformation for (Xj,.... X„) i.i.d. b(\ a\. , 

, v ’ Wll er e 


x - an\ 0 , 


0(1 - 0 ) 


n 


(c) Cany out similar exercise for (X lt X n ) i.i.d. exponential with mean ©where " 

~A N [e,Sl). 

6.3 CAN Estimators: Multiparameter Case 

Let T = (T ly ..., TJ be a vector valued estimator which is consistent f 0r a 
vector parameter 0 = (0 b ..., B m )' as defined in Chapter 5. If there exists a 
sequence a n > 0 and a n —»«> such that the vector valued r.v. a n (T- d) has 
in the limit a non-singular proper w-dimensional normal distribution then we 

say that T is CAN for d. Thus if a n (T - 6) A A^ m) (0, A(0)) where A(0) i s 
a symmetric positive definite matrix then T is said to have asymptotic normal 

distribution with mean vector 0and variance covariance matrix xhis 

a 2 

n 

I A(0)^ 

0, —2 • As in case of w = 1, usually the choice 

^n J 


is denoted by T ~ AN 0 ") 

of a„ = V/z will suffice. 

One consequence of the above definition is that each T x is CAN for 0, with 

X-(Q) ni m 

AV(Tj) = ——- and any linear combination T'= 2 /,T, is CAN for 2 /,•©, 

n i=i i=i 

with AV(T') = j-l' A{6)1. 

As univariate CLT for i.i.d.r.v. with finite variance is helpful in constructing 
CAN estimators for a real parameter 0, in a similar way multivariate CLT 
(MVCLT) for i.i.d.r.v. with finite variance-covariances is useful to obtain 
CAN estimators for 0 = (0j, ..., 0 m )'. Let Z = (Z h ..., Z m )' be such that 

E(Z) = /i and M z - A which we assume to be pd. Let {Z,}” be i.i.d.r.v.s 
distributed as Z and let Z = (Zj, ..., Z m )' be the mean vector of the sample i.e. 

Z r = — 2 Z ri , r = 1,2, ..., m. Then MVCLT asserts that Vn(Z -/x) —» 
n /=i 

W*" 0 (0, A) or Z ~ AN^ m) ^tx, ^ A^. We now illustrate this technique by way 
of a few examples. 

Example 6.3.1 Let (X h ..., X n ) be i.i.d. MjU, <r 2 ). Then E{X x ) = n, E(Xf) 
= ju 2 + o’ 2 . Let Z = (Z|, Z 2 Y where Zi = X\ and Z 2 = X, 2 . Then A in general 


/ 


is given by 


A 


^2 - Ml. 


/ 2 
: y 


and for the normal distribution 



Consistent Asymptotically Normal Estimators 125 


/ G 1 


2/x<t 2 

2 pa 1 2 cr 4 + 4cr 2 /i 

2- 


A = 

from MM* **)• 


\ 


J. Therefore by MVCLT, we have for samples 


'mp 

~ AN™ 

( " ] 

1 

« 

r g 2 2pa 2 y 



_U 2 + ° 2 j 

n 

\2po 2 2<t 4 +4<t 2 p 2 , 


where m[ and m' 2 are first two raw sample moments. 

Example 6.3.2 Let the pmf of (X, y) be given by 

/(*, y. A, p) = y p'(l -pf-y ^ 0 < p < 1,0 < A, 
y = 0 , 1 , 2, ..., x, x = 0 , 1 , 2, ... . 


Then E(X) = A, E{Y) = Ap and M z - 


A Ap 
Ap Ap 


\ 


This follows from the fact 


that X ~ P (A), Y~ P (Ap) and £(XF) = £{XE(T I X)} = E(X Xp) = E(X 2 p) 
= (A + A 2 )p. Therefore 


f-1 

- AN {1) 


' A'j 

1 

% 

f A Ap^ 


U; 



U p, 

n 

^Ap A p. 



Example 6.3.3 Let Z = (Zj, ..., be a multinomial r.v. in fc-cells with 

k 

f{z\i • • •» Zjt) = Zj 7 Zj 2 ... z^j Z/ = 0 or 1, X z, - = 1 and 0 < p,- < 1 , XP/ = 1 • Then 

1=1 

E(Z,) = p, and V(Z,) = A /( - = p,(l -p,) and Cov (Z f - Z 7 ) = - ppj. Thus 


(Pi (1 - Pi) 



V ~PkP\ 


-PiPit > 
• • • 

••• Pk (1 ~ Pk) j 


However as X z,* = 1, M z is singular and has rank (k - 1). We therefore 
consider only (k - 1) of z,’s say (zj, ..., Zjt_i) so that z k = (1 - z^..., - z fc _i), 
andp fc = (1 -pj -p 2 ,p k _,). Then by MVCLT as applied to Z = (Z b ..., 
the vector of (k - 1) relative cell frequencies in (k - 1) cells will have 
the asymptotic normal distribution with mean vector (p 1? ..., p k _\)' and 

asymptotic variance covariance matrix - A where A„ = p,(l -p,) and 

= ~ Pi Pj, i = 1, 2, ..., k- l,j = 1, 2, ..., k- 1. 

Similar to Theorem 6.1.1 which establishes invariance of the CAN 
property under continuous differentiable transformation we can show that 



126 A First Course on Parametric Inference 


if T ~ (0, A(0)/a 2 ) and if ip = (V\, • Wt)' are such that 

< 90 . a *e 


' GAG 

V(0\ 




a 


n J 


where 


G = 


k and j = 

1, 2, .. 

d ¥ , 


<90, e '•* 

• 

• 

<90 m 

• 

• 

9Wk 

• 

9Wk 

,90, 

99 m J 




provided G A G' is positive definite. The proof is similar to the proof of the 
Theorem 6.1.1 and uses Taylor series expansion of each component of the 
vector V. Thus a n ( ¥ (T) - V (0)) = G a„(T - 0) + a„R where a„R A 0 and 

as aJT- 0) -* A/W (0, A), G a „(T- 0) A N«> (0, G A G'). Hence we have 
Theorem 6.3.1. Let T - Arf m) (6, A(0)/ a 2 ) and ¥ = ( ¥l . n y and G , 

such that GAG' is pd. Then i/XT) ~ AN* k> (y<0), GAG'/a„ 2 ). 


de s 


)) 


Example 6.3.3 In Example 6.3.1 showed that « )' is AA/«> ((«', a ' 2 )' 

AAl Note that here (6,, 0 2 y = = („, ^ + ^ * ( ^7 

Now suppose we want to obtain CAN estimators of n and o 2 Then we 
take ¥l = (i l= n, ¥2 = M- = a 2 ^ 

( 1 
2It,' 


G = 


0 

1 


f 1 

-IfL 


O^i 

1 


Therefore *(«,'. m ' 2 ) = m ; = x and *J) = m ' _ m ;2 = ^ = the 
second central moment of the sample and we have 


' X 
S 2 /n 


AN® 


V 


LV 


^ ] GAG' 
2 » 

L1 n 


By straightforward calculations we can show that 


GAG' = 


'V 2 

0 


0 N 
2cr 4 


If we use the general expression for A given in F 

given in Example 6.3.1 namely, 

ii\ - n'2 


A = 


2 J 




Consistent Asymptotically Normal Estimators 127 




0 N 

h 


then GAG' = 


>2 

^3 


M3 > 

~Hlj 


where th, fa- V* are central moments of the pdf {fix, 0 ), 0 e Q). Note that 
if the pdf is symmetric about mean or even if just ju 3 = 0 we have GAG' = 

£2 

diag (M 2 > M 4 " M 2 ) aQ d ^ and — would be asymptotically normal with off 

• — e2 

diagonal elements zero or equivalently X, and — would be asymptotically 


v s 2 V. 


is a solution of the moment 

0\ 


independent. Further observe that I X, — 

V n / 

equations m[ = M and = /t 2 + o 2 and * 2 ^ ) = f 2 )t 1 ; 

G" 1 . In Section 6.4 we will consider the method of moments which 
generalizes these results as well as the method of percentiles. 


6.4 Method of Moments and Method of Percentiles 

Let (Xi, ..., X„) be a random sample of the size n with pdf belonging to 
[f(x t 9), 0 e Q. c. R m ) and let {^(X), ..., U m (X)} be m r.v.s such that 
E(U,(X)) = r = 1 , 2 ,..., m and Cov (U r (X), U S (X)) = A„(0), where A( 0 ) 

ispd. Then {f/(X,)}" = i are i.i.d. r.v.s distributed as U(X) = {U l (X ),..., U m (X))' 
for which MVCLT holds and the mean vector of sample U(X) = U m ) 


— 1 " 

where U r = — X t/ r (X,), is CAN for y/ = (yq, ..., yf m )' with asymptotic 
w m 

variance covariance matrix — A(0). The method of moments consists in 

n 

equating U r = r = 1, 2, ..., m and solving these m equations for 

(0i, ..., 0 m )'. The above moment equations admit unique solution provided 

VO 


0 ( 0 ,,..., 0 m ) 


* 0 and then 0 r = /t r (G,, t/ 2 »..., t/ m ) where = (h h ..., fc m )' 


0(U^i,..., W nt ) 

is the inverse function of the y/. Let - 3 - 7 :- 3 —- = G then H = 

( 7 ( 01 , ..., 

d (/ti, ...,/i m ) _ q-\ Note that since Q i s assum ed to be non-singular 
"(^l* 9 m ) 

we can consider 0 r = y/ r (B), r = 1 , 2 , .... m as a one-one transfor¬ 
mation with unique inverse given by 0 r = h r (<t*i> •••> 0 m ) with Jacobian 
of transformation as H - G " 1 or GH = HG = / the identity matrix. By 

Theorem 6.3.1 we have 0 = (0,,..., 0 m ) ~ AN (m) (9,HAH'/n) or equivalently 

0 - AA (m) ( 0 , G _ 1 A(G _ 1 )7n). 

We now consider application of method of moments in multiparameter 
exponential family. Let (X b ..., X„) be i.i.d. with pdf {J{x, 0 ), 0 e £2 c R m ] 
which belongs to m-parameter exponential family so that 





128 A First Course on Parametric Inference 


m 


log f(x, 0) = 2 uf0) K^x) + v( 0 ) + w(x). 


(6-4.1) 


Since (Kj(x),K m (x )) is minimal sufficient for 6, we con«iHf>r *u 

Maer the metho(1 


of moments based on (K^x), .... K m {x))\ Now as 


j.(“i.o 

.. 


* We 


consider first a parametric transformation (f> r = u (0) r =l "’ ^ ' * 6 

1:1 and determines the unique inverse transformation’ 0 = a (fo ? is 
to. Then in the new parametric form we have ' ’ ’ 2 . 

I°g fix, q(ip)) = 2 <j> r K r (x) + u,(<f>) + w(j:) 
where t>i(0) = v(q(<t>)) and q($) = ( 9l (0), 9<)lW )'. Now 

^'°g/ „ , . Ai, 

-= K r (x) + 1 


<? 2 lo gf _ d\ 
d<t> r d<l> s d(j> r d(f) s 

and as E = 0, r = 1 , 2, .... m, £(Jf^)) = - = 77 , 

Applying method of moments to (AT, (*),.*„(*))' the moment equations 


are 


n ,5 = z r = T),(<p), r = 1 , 2 , ..., 


m 


(6.5.1) 


- Th ! syst ® m of equations given by (6.5.1) will give the estimator 
0 = (0.<> m )if 


<?(>?.. Vm) 

^(01> • • •> 0 m) 


* 0. Now - j (/k\ 

d<!>s dtj> r dtf> 5 “ y " ( ^ ) 


where 7„(0) denotes the (r, j)-th element of the Fisher information matrix 


7(0). Thus 

determined! 
Now 


dQ7i,..., 7?J 

d W\ »-.0 w ) 


- I 7(0) I * 0 which implies that 0 is uniquely 


f-d 2 log /' 1 

- J <?log/ 

dlog /^1 

< vQrdtys j 

\ *>r ' 

J 


= E[(K r (x) - n r ) (K s (x) - ru)] 
= Cov (K r (x\ K s (x)). 



Consistent Asymptotically Normal Estimators 129 


Thus (Ki(x) . K m (x))' are i.i.d. r.v.s with mean vector T](0) and variance 

covariance matrix 7(0), the Fisher information matrix of 0. By MVCLT we 

have 

Z = (Zj, Z m Y ~ AiV< m) (7?(0), 7(0)/n) 

and Z is CAN for 77(0). Therefore 77 _1 (Z) is CAN for 0 arid its asymptotic 

. . . . 1 ^_l d(Th, Tj m ) 

variance covanance matrix is given by -Gr J(G ) where G = -^—^y 


= y(0). Therefore 0 ~ AA ( '" ) 
m where 




0 , 


n 

jM )'1 


n 


J 


Now 0 r = qX<j)i, ..., 0 m ), r = 1, 2, 


-i-i 


_ d(0] , • • ■» 0/n) _ <9 (Mi , .. •> M m ) 

d(<p i,.“L^ 0 i. 

Again using Theorem 6.5.1 we have 

0 — (6*1, ...» 0»i) = (*?l(0)> •••*» *7/ii(0)) 

is asymptotically normal with mean vector 0 and variance covariance matrix 
^when evaluated in terms of 0by using 0 r = w r (0 ), t* = 1 , 2, ..., 

M 


m. This can be shown to be equal to ^ 7 k#)- 


Now J„(8) = £ 


/ 


A 


01 og/ <?log / 

<? 0 r ' de s J 


But 


Hence 


dlogf £ <?log 

. 


m 


/n(9 ) i 7 ( ^ y 


Therefore 

or 

Therefore 6 ~ AN 


0, 


de r de s 


= ? J « W) <?e r de s 

j(8) = e _1 •/<$ (Q' l y 
QJ-'if) q = /■'(«• 

Hence, we have the theorem, 


f J~\e 


n 


Theorem 6.4.1 If [f(x, 6), B e Cl c R m } is m-parameter exponential family 
then the moment estimator of 6 based on minimal sufficient statistic is 

f °r 6 with 



130 A First Course on Parametric Inference 


e - AN tm) (0, r'(d)ln) 

where J{9) denotes the Fisher information matrix. 

Example 6.4.1 Let (X b ..., X n ) be i.i.d. N(9 h 6 2 ) then 

i i (x — 9 l 2 

log fix, e h 6i) = - \ log 2*r -2 log 02 - — e ' - 

and as seen earlier pdf of X belongs to 2 parameter exponential family with 
K x (x) = x, K 2 (x) = x 2 . Thus moment estimators based on minimal sufficient 
statistics are solutions of moment equations m{ = x = an( j 

m 2 = ^Xx 2 = 0 2 + 9 2 . The solutions are d { = x and § 2 = - x ~ 2 . 

Then 

_ _ f (9 2 /n 0 

(®- ' ANm {e " e ' Y { o 2 elm 





. Earlier this result was obtained in Example 6.3.4 


using different techniques. 


Example 6.4.2 Let J{x, y, A, p) = p- v (l - p) x v e A 0 < p < 1, 

W x! 

A > 0, y = 0, 1, 2, ..., x, x = 0, 1, 2, ... . We have already seen that the pdf 
belongs to two parameter exponential family with 

, (MX 0 ^ 

XW- P ))) wiK ' =y ’ K ' = x - 

Therefore moment estimators are solution of the moment equations x = X, 
y = Xp or X - x,p = y/x and (X,p)' is CAN with mean vector (A, p)' and 

asymptotic variance covariance matrix diag —, — - — Note that if 

yn nX )’ 

x = 0 then y = 0 and A = y/x is technically undefined. In this case moment 
equations are A = 0 and Xp = 0 which do not admit unique solution for p and 
A = 0 is a solution outside parameter space Q. = {(A, p)' I 0 < A, 0 < p < 1 }• 

However P(X = 0) = e~ nX 0 as n «, for any A > 0 and such samples 
would rarely occur in practice if n is sufficiently large. 

Example 6.4.3 Consider multinomial distribution in A-cells as given in 
Example 6.3.3. One can show that the pmf of (Z b ..., Z k _ { )' belongs to 
(* - 1) parameter exponential family with Kj(z) = i* = 1, 2, k-\ so 

that moment equations give p t = Zj = relative frequency in the i-th cell, 






Consistent Asymptotically Normal Estimators 131 


j 2 The Fisher information matrix is given by J(p)— such that 

^ - 1 2, •••* k - \ s = \, 2, ..., k - 1, J rr (p) = ~ + tp ^«(p) = ““ where 
f 0 rr- A ’ ” r Pk Pk 

- -P\ ~Pi~ ~Pk-^’ ^ ow we can show that J~ l {p ) = (</") is such 
that / f (p) = an< * J rS (P^ = ~ PrPs so that we have the same result as 

in Example 6.3.3. 

We now consider the percentile method to generate CAN estimators for 
vector valued parameters. For this purpose we use the following theorem. 

'fjjEOREM 6.4.2 Let (X 1? ..., Z n ) be a random sample of size n from 
0 € £2 Let 0 < pi < P 2 < ... Pm ^ 1 and r,- = [wp;] + 1. Let 

(£ ' (0), •••* <^ m ( 0 ))' be the m P er centiles and (X (n) , ..., X ( , m) )' the 
corresponding sample percentiles. Then (X (n) , ..., X (rw) )' is asymptotically 
normal with mean vector (£ Pl (6), ..., % Pm (d)Y and variance covariance 


/>,(! - Pi) 


matrix A with Var (X {n) ) - ^2 


= 


Cov (X (li) , X ( , ;) ) n f^ pt )f^ p ,) X, i' ,<J 

provided f(% Pi ) ^ 0, i = 1, 2, ..., m (David, 1981). 

The percentile method consists of setting up equations X (r;) = 

1,2,...»w and then solving these for 6 = (0i,..., 6 m ). A sufficient condition 

for the unique solution is I GI = I | * 0 • Under this condition 

the percentile estimator 0 = (d \,...» 6 m ) is given by 0 S — h 5 (X^y X( rm )), 
s = 1, 2, ..., m and 

0 = (0 h ..., e m y ~ G- l MG- { y/n ). 

We now consider a few examples of applications of percentile method in 
the multiparameter case. 

Example 6.4.4 Let(X„ be a random sample of size n from Cauchy 

distribution with location /r and scale O so that t e p is g 

1_1__L 

^ ' J 1 +((r - ^)/ff) 2 17 


rr <t 2 + 


J__ x s «i, R i- a> °’ 

(x - M) 


Now £ = ju + Cp a, where ± tan'V,) + !««' or c, = tan (V - ««> 

f°r 0 < n < ] 



132 A First Course on Parametric Inference 

Now as there are two parameters we consider 0 < p x < p 2 < 1 and sol Ve 

the percentile equations X (n) = £ Pi = P + c p\ ° anc * ^(n) ~ ^ + c p 2 G giving 
CAN estimators 

[^(r 2 ) ~ *(n>] _ ^ and c n Z ^ 

C P2 ^ Pl ^ PI ^ P2 

which are linear functions of components of order statistic. Now asymptotic 
variance covariance matrix of (X (m) , is given by 

AV(X (n) ) = p,(l -pd/MSp,)] 2 = ^ 2 fl + C 2 p .) 2 p,( 1 - Pi ), i = 1,2 

and ACV(X,„>, X (r2> ) = p,(l -p 2 ) n 2 (1 + c 2 ,)(l + cj 2 ). 

Many choices are possible for 0 cpj < p 2 < 1. Using the fact that the pdf 
is symmetric about p the median, one can take p\ = 1/4 and p 2 = 3/4 giving 
c pi = - 1 and c p2 = + 1 so that 

/Z, = Xin> + X(r " and d, = X(r2> ~ Xi,,) . 

But AV(X M ) = ^ o 2 , / = 1, 2 and ACV(X W , X w ) = ~(T 2 . 

2 2 

Therefore AV(p t ) = - ^ ,AV{o x ) = and ACV(p h dr,) = 0. 

On the other hand noting that p itself is median we can take p\ = 1/2 and 
p 2 = 3/4 so that c pi = 0, c P2 = +1. The percentile estimators are now Pi - 

(*([«/2]+i) = the sample median and <r 2 = (X (/2) - p 2 ). Again using Theorem 
6.4.2 we have (p 2 , dr 2 )' is CAN for (p, cr)' with 

M(t2) = £° 2 ,AV(a 2 ) = *£- 
ACV(p 2 , <j 2 ) = 0. 

Example 6.4.5 We consider a generalization of Example 5.4.3 and Example 
6.2.3. Let/(x, p, a) = — ■ ~ /£l for p - a < * < p + a. Let 0 < p x < l/ 2 < 

p 2 < 1 such that pi = j - 5 and p 2 = 1 + S. Here 0 < 8 < 1/2. Then 
€ pi = M - ca and £ p2 = p + cg, where c 2 = 25. The percentile estimators 

are given by and Now AV(X (ri >) s 

* 2c 

dV(X U2) ) = ^ (i - 5 2 ) and ACV(X W , X M ) ThiS 



Consistent Asymptotically Normal Estimators 133 


leads to AVifi) = “ 8 j = 0 - 28)/&8n = AV(cr) and 

A CV(/i. 5) = °- 

Observe that as fill) - 0, the pdf vanishes at £ 1/2 and the Theorem 6.4.2 
is not applicable. Further as observed in Chapter 5 and in Example 6.2.3, X (1) 
and X(n) would be consistent for p - a and \u + o respectively and T\ = 

and I 2 = — ^~ 2 —~ would be consistent for p and o respectively. 

However one can show that the asymptotic distribution (T h T 2 Y is not normal 
as that of (X(i), X( n ))' is not normal. We refer an enterprising reader to David 
(1981) which gives asymptotic distribution theory of extreme order statistics 

in detail. 

Exercise 6.4 (1) (a) Let (X,,.... XJ be i.i.d. with pdf/(x, p, <7) = exp {- \x-p\io), 
x g R u p € J?|, o> 0. Obtain moment estimator of ip, a)' and its asymptotic variance 
covariance matrix. 

(b) Let 0 < Pi < y < Pi < 1 and p x = | - 8, p 2 = ^ + 8. 

Show that £ P1 = p - co and % P2 = p + co and the percentile estimators are 


y , y 

A(fl) -—i2l = p and = ** • Obtain the asymptotic variance covariance matrix 

of ip, dy. 

(c) Noting that £ 1/2 = p we have p x - X ([n/2 ] +1) . Obtain asymptotic variance covariance 
matrix of p x and 


X(i2> + ^(0) 


O i = 


X (n) ~ X <lnl2]+\) 


(2) (a) Let (X,, .... X„)' be a random sample of size n from pdf. 

fix, p,o)=j; exp j- x>p,o>0. 

Let (X (1) , .... X M ) be the order statistic and define y, = in - i + 1) (X (i) -X (/ _ n ), i = 
1,2,..., n with X(Q) = p. Then we have seen that (Kj,..., y n ) are ld*d. exponential with 

mean a. Let T, =X (1) and T 2 = Z (X (il - X (1) ) = Z Y,. Then T x and T 2 are independent. 

Show that T 2 /in - 1) is CAN for 2 a and T { is consistent for p but not CAN. Show that 

n(T { - ju) -4 y, and -^L -4 Z 2 ~ Nio, (flin - 1)). This is a mixed situation in which 

n - 1 . 

one component of vector statistic is CAN but the other is not. 

(b) Obtain moment estimator of ip, <T)' and its asymptotic variance covariance matrix. 

(3) (a) Let X= (X,. XJ be an m-variate normal r.v. with mean vector ip x ,..., pj 

and variance covariance matrix o 2 /* where /* is (ft x k) identity matrix. Show t at t e 
Pdf of X belongs to (m + 1) parameter exponential family. Use the r ®salt °n met o o 
moments as applied to a r.s. of size n on the vector X to obtain _ 

P m * O' 2 )' and its asymptotic variance covariance matrix, tain es 
of £ liPt and also that of (I l,p,)/a Obtain their asymptotic vanances. 



134 A First Course on Parametric Inference 

(b) Use the above result to obtain CAN estimator of dose effect 4 when m = 2 

d. This corresponds to the situation in clinical trials where X x is response to 
placebo and X 2 is a response to treatment and d is the treatment effect and we have tw 0 

independent normal samples of size n each. 

(c) If in (b) above we have paired data i.e. X[ is the initial measurement and X 2 j s tile 
measurement on the same individual after the treatment, then the assumption of independence 
of (Xy, X 2 ) is not realistic and we instead consider Y t = (X 2i - X u ), i = 1,2, n the 
difference between the measurements before and after the treatment on n patients. Then 
(K,,.. Y n )' can be regarded as a random sample of size n from N(d, <r) where d measures 
the treatment effect. Obtain CAN estimator of dto and its asymptotic variance. 


6.5 Gauss-Legendre-Boscovich Revisited 

Chapter 1 outlined how the statistical methods were originally proposed in 
the context of direct and indirect measurements of “magnitudes of interest” 
arising in connection with problems in astronomy and geodessy. 

First consider the problem of direct measurement where the simplest 
model is X,- = 0 + £,-, where £,• are i.i.d. with E(£j) = 0 and Var (£,•) = o 2 . 

Then the Least Square Estimator (LSE) of 9 is X and by CLT we have X ~ 

AN(d, o*/n). Thus X is CAN for 0. We have also <7 2 = — — - — ■ = 

is consistent for d 2 and <7 2 will be CAN for <r provided E(X - Gy < oo. As 
seen in Example 6.3.1, (X, S 2 /n)' is CAN for ( 6, d 2 y. 

The above result can be easily extended to the case 6 =(6\, ..., 0 m )' and 
the model is X)j = 0, + £,y, i = 1, 2, ..., m and j = 1, 2, ..., n where {£,y} are 
i.i.d. for fixed i and j = 1,2, ..., n with covariance matrix A. Then LSE of 
6 is vector of sample means X and by MVCLT X ~ AN (0, A In). The 
CAN estimator of A can also be obtained and is given by ( Sjj/n ) where 

n _ 

Sy = X (X ir - Xj) (Xj r - Xj ). The CANness of SJn can be proved assuming 

r=l 

that fourth order moments exist. 

Next we consider the simplest indirect measurement problem where the 
model is y,- = J3 0 + /fo + £,-,/= 1,2, ..., n where £, are i.i.d. with £(£,) = 0, 
Var (£,) = o 2 . Note that this is also the standard linear regression problem. 


Then LSE of (A, P,Y are ft, = ^ y) and (S^y-^x. Note 

that here (x lt x n ) are constants and p y is defined only when S 2 X * 0 i.e. 
there are at least two distinct xfs, which we will assume to be the case. Note 


that E(yj) = a + fix, and E(y) = a + fix and therefore E(fi\) - p\ and 
therefore E(fi 0 ) = Po i.e. (P 0 , p x )' is unbiased for ( Pq , p { ). Now suppose that 

the errors are normal i.e. {£,}? are i.i.d. N( 0, a 2 ). Then as seen in Example 
2.5.4 the joint pdf of (Tj,..., Y n ) is a three parameter exponential family with 
(£ Xjy it X y n X y 2 ) as a minimal complete sufficient statistic for (Pq, p\, d 2 ’)'• 
Therefore it follows that (P 0 , /?,)' is M-optimal for (Pq, p { )' when errors are 


\ 


Consistent Asymptotically Normal Estimators 135 


^ A 

•: a normal. One can show that <j 2 = — ^ ~ Po ~ Ai*/) 2 . 

l*!*' 1 ** „ O f 


n - 2 


is unbiased for 


a 2 and being a function of minimal complete sufficient statistic, it is MVUE 
for <T 2 . Therefore from results of Chapter 4, (j§ 0> p, a 2 )' is M-optimal for 
(A)» Ai> ■ Th e var iance-covariance matrix of (A 0 » Ai) is given by Mp = 


S 2 X 


\ 


-x 


-X 

1 


Observe that under normal errors, A would be bivariate normal with 

mean A and variance covariance matrix Mp for each n as p 0 and A, are both 
linear functions of (Y^ ..., Y ') which has n-dimensional normal distribution 
with mean vector (Ao + Ai*i> • ••» Ao> + Ai x nY and variance-covariance 
matrix G l I nxn . Consider Ai» the regression coefficient. Then even under 

normal errors, Ai ~ N(J5, a 2 /S 2 ) and would be CAN for Ai if and only if 1/S 2 —> 


r 


A 5- 
P ’ 5? 


2 2^ 


n 


is CAN for if and only if 


0 as n -> oo. Similarly p 0 ~ N 

—^ 5 —»0 as n —> oo. Thus (A 0 , AO would be CAN for (A 0 , Ai) only when 
nS~ 

the constants (xj, ..., x n ) satisfy some conditions. In the usual regression 
situation (xj, ..., x„) are fixed values of so called ‘independent’ or ‘control’ 
variable and (y l5 ..., y„) are values of ‘dependent’ or response variable and 
usually in large experiments (n -> °°), there are replications at a few distinct 
values of x, where each distinct value x f occurs nj times with 2, n, = n. We 
will not go into further details. The above results can be extended to the case 
when y s = Ab + Ai*i/ + ••• + A***; + £ i where {£/}” are i.i.d. N( 0, cr 2 ). 

As is well known in this case we have A = (X7Q -1 X'Y and Mp = <J 2 (X'X)~ [ 


Note that under normal errors model 



f 1 

l >| 

where X' = 

Xjl 

• 

• 

• 

*kl 


\*\n 

x knj 


A 

A - Mp) and A would be CAN for A if every diagonal element of 


,-i 


(X X)- -> 0. 


2 

For the non-normal but independent errors with £(£,) = 0 and Var(£,) = a 
we will need additional conditions on existence of fourth order moments of 

£ i in order to claim asymptotic normality of A • The theory of MVUE or 
CAN estimators of A in more general regression setup is available in standard 
te xts on regression analysis such as Draper and Smith (1981) among others. 
However we will not go into further details here. 


k 



136 A First Course on Parametric Inference 


One could consider the method of least absolute deviations (LAD) wh er 
in the direct measurement problem, we minimize X I X, - 0 | which lead* 


to the estimator 6 = X( [n/2 ] + i) the median of the sample which is CAN if the 
distribution is such that £ 1/2 (F) = 9 and the pdf f{x, 9) does not vanish at 
x = 9. The situation in case of indirect measurement problem is much more 
complex but we will briefly outline the solution. 


Consider the model y,- = A) + A*/ + e h i = 1, 2, n. Then following 
Boscovitch, we determine (A), A) such that X (y,- - A) - A*;) = 0 or y = 
A) + A* or A) = y - P\x and subject to this constraint minimize I | y._ 


A) ~ A*/ i-e. A is estimated by minimizing S = X I y,- - y - A(jc, - jc) I. 

We will now describe an algorithm given by Laplace which minimizes S and 
yields /3 the estimate of j3 having least absolute deviation. First observe that 
those Xj s which are exactly equal to x will not contribute to the process of 
minimization and therefore we consider only those observations for which 
Xj - x * 0. Suppose there are m observations such that Xj * x, then we 


y. — y 

calculate b t = ^ , for such observations and arrange bj in decreasing 

order of magnitude 


^( 1 ) ^ b(2) ••• ^ b(m ), (m < n). (6.5.1) 

Corresponding to the above arrangement arrange \x f - x I in a similar way 

^ - x I > ... > | jc_ — x I (6.5.2) 

z r m 

i.e. Ai) the largest value among b ( , corresponds to i = r x etc. 

til 

Let D = X I x r . - x I, then S would be minimum for A = b ( ^ the k -th term 
of the sequence (6.5.1) where k is the smallest integer for which the partial 

sum S k = I \ Xj - x I > ^ Z>. We take 0, = b w = (y lk) - y)/(x, r) - x) where 

k is obtained as above - » S k > i D then b m is uniquely determined but if 

° k 2 D ^ ben * s n ot unique and 0| can be taken as any value between 

*(«-!>]- Having determined 0,, we take j) 0 = y- j},x. Boscovitch had 

" = , 5 ° bservations and he obtained 0,) using geometric argument- 
Laplace gave the above algorithm for the general case. Prior to Boscovitch 

the traditional approach was to calculate f") slones h - — ~ y 4 ^ tak6 

21 l0pes iJ - (x, - Xy) 

Pl = pi ^ V - Euler suggested to obtain estimates of (ft. 0i) b y obtain ‘ n8 

UJ 


F 


Consistent Asymptotically Normal Estimators 1 37 

w o equations in two unknowns in the following manner The h . , , 

/ = >■ 2 ." W3S SUbdiYided b * c * 10 °sing a point * 0 and thenslmg" 

^ 1 { t « O "N 


J. ^ ~ 00 “ 0i*i) = 0 
x]tx„ (yi - Po- P l-C/J = 0 


(6.5.3) 


However the estimate P would then depend on the choice of* 0 . On the other 
hand Boscovitch method is quite objective. Laplace used it in his 1789 
memoir on the Figure of Earth and later in his famous treatise on Celestial 
Mechanics published in 1799. However the method of least squares advocated 
by Gauss-Legendre was equally objective and was much more easy to use. 
Therefore Boscovitch method was given a back seat by astronomers and 
other scientists. 

However currently the Boscovitch method based on L r norm i.e. the norm 
XI dj I, for a vector (d\, ..., d n )' e R n is making a strong come back over the 

Gauss-Legendre method based on Z^-norm i.e. X dj primarily due to the fact 
that estimates based on L r norm are more stable than those based on L 2 - 
norm. The reader may recall the discussion in Fisher’s paper (1920) comparing 
estimators of o based on X I *i - x I and X (*/ - x ) 2 referred in Chapter 1. 
Even here Edington in a footnote to Fisher’s paper observed that estimate 
based on X I x,- - x I is preferred by astronomers than that based on 
I ( x i - x) 2 particularly when observations with large deviations are rejected 
as outliers. Estimation of parameters when outliers are present in the data is 
a fascinating area and we refer interested readers to an excellent text by 
Barnett and Lewis (1984). 

In the next chapter we consider the method of maximum likelihood one 
of the most commonly used method of estimation to generate CAN estimators. 



7 

Method of Maximum Likelihood 


7.1 Introduction 

Suppose that a random sample of size n yields a data X\ = x\, X n = x n 
under the model specified by the distribution of X belonging to the class 
{f{x, 9), 6 6 ft}. When X is discrete we can compute the probability under 
each 9 € Q given by 

P[X { = x h ..., X n = x n \9] = fl f(x h 9) (7.1.1) 

i=i 

We have seen in Chapter 2, while defining Fisher Information that the variation 

n 

in FI f{x n 9) for fixed x as 9 varies over Q provides us information about 

9. The RHS of (7.1.1) for fixed jc as function of 9 e £2 is called as the 
likelihood function or likelihood of 9. In case X is a continuous r.v. then 
FIX! 

” ^1> •••> I 9] = 0 and we define the likelihood of 9 as follows: 

Consider a neighbourhood of each x h given by jc, ± —, then 



2 < < x ' + ~i = \,2, ...,n \ 9 


A J 8 X j Y\ ( Sx 

lT i+ Y- e - F 


(7.1.2) 


By mean value theorem 


r + ~T’ e )~ - ~T> 0J =/(*,-, 9) 8 Xi + 0(8 Xi) 

and RHS of (7.1.2) reduces to [ft /(*„ 0)~L + 0 (&), where Sv is «l> e 
volume element (&„ ...&„). Then the likelihood of the observed samp* 

(*" " ^ U defined as S /<** ») ■ Thus in either case ft /(*„ 0) for ^ 

't-Mtaod of the sample. For 

approximation for th* u . ^ ata x an ^ f° r continuous X it 

» w,- ■ mm * «* «■»« 



Method of Maximum Likelihood 139 
case, where the likelihood of X = x ic 

p\X = x\ 6} the likelihood function is a " U "T‘ Cally same as Probability 
meaning to the likelihood of 0, or 0, and itH Unctlon of 6 and there i« no 
probability which is in fact a set function tT rt‘ the axioms of 
random sample of size n on X with Ddf iftl e " kehhoo<i function of a 

F e e is denoted by 


n 



(7.1.3) 


and is a function of 0 € £1 Here 9 the fr/ 

vector valued but L(x , 0) is always a real nnmh c f, *’ ^ ’ Can be real or 
we define 0, to be more likely, equally “keh oHess^7? Fuher ° 912) 
as m, 0.) > Ux, 0a), L(x, 0t 2 

The method of maximum likelihood prescribes estimating 0 byl^here l 
IS the maximal element defined bv the ah™* , . y 

«>r - ? -x r- 

likelihood was claimed by Fisher to be more objective than the method s 
least squares or the method of moments. For the method of moments a 
noted in previous chapter the choice of the basic random variable U(X) to 
generate moment equations is indeed arbitrary although it was a convention 
to use U(X) = X. Similarly in the method of least squares as applied to 
indirect measurement problem the relationship, between y and (x, x) 
could be better represented by relationship between transformed variables «(y) 


and transformed functions (*,(*,),..., g k (x k )) where Jacobian - g *> ; s 

IT?,, 9(*l ...x k ) 

non-singular. Further the use of L r norm as prescribed by method of least 

squares is also a matter of choice. Note that Boscovitch had already used L r 

norm and one could use in general L p -norm given by 11 dj? for 0 < p < <*>, 

or sup-norm given by Max I d { I, where (d h d k ) e R h On the other hand 

Fisher pointed out that once the model {/(*, 0), 9 e £2} is specified the 

method of maximum likelihood is completely objective. It must be pointed 

out here that the method of maximum likelihood can be applied only when 

the pdf J{x, 0) is fully specified except for the indexing parameter 9. On the 

other hand the method of moments and least square method are applicable 

without such specification. As noted before in case of problem of direct 

measurement X f = 9 + e h where {£,}" are i.i.d. with £(£,) = 0, Var (£,) = a 2 , 

^ is the moment estimator and the least square estimator for distributions 

such as normal, double exponential, uniform among others. 

Fisher (1912) derived maximum likelihood estimators for (9, o 2 ) in the 
normal model and he was careful to point out that likelihood function does 
satisfy the laws (axioms) of probability. It is important to mention that 
isher was barely 21 years old and had just obtained his Maths Tripos 
(“atchelor’s degree) from Cambridge when he wrote this path breaking 
Paper presenting the method of maximum likelihood leading to maximum 


140 A First Course on Parametric Inference 


likelihood estimators (MLE) of the indexing parameter 0 in the model SDecif 

by {/(*,*), 0 €Q}. ^ led 

Before going into the theoretical study of MLE we consider a few examples 

Example 7.1.1 Consider the direct repeated measurement problem where 
Xj = 0 + £j where {e,}? are i.i.d. N( 0, a 2 ) then the likelihood is given by 


L(x, 0, <j 2 ) = 


1 1 

n 

pvn 4 

£(* f -e) 2 l 

IG } 

CA.p 4 

2<7 2 


>, 9 <=R u g 2 >o 


or 


log L(X, 0, G 2 ) = - | log 2 7T - | log G 2 - -- (X ' 2 0)2 (7 14 ) 

2(7 

To maximize likelihood or log likelihood, we consider two stage maximization 
of (7.1.4). Consider G 2 fixed then RHSof (7.1.4) is maximized for minimum 
value of I (*, - 9) 2 which occurs at 9 = x and therefore 


sup log L(x, e, <7 2 ) = - 1 log 2*- » log CT 2 _ JL (7 .,. 5) 

w ere S‘ = £ (x ; - x) 2 . Next, consider maximization w.r.t. <T 2 . Taking the 
denvative w.r.t. <r 2 of RHS of (7.1.5), we have , 7 2 -a solution of 

— /i A ^ ^ 

2 g 2 + 2g* = 0 or = 52/ «- Now the second derivative w.r.t. cr 2 of RHS 
(7.1.5) at <7 is given by < o. Therefore, the maximum of log L 


and hence of L occurs at 9 = *, ^ = S 2 /n. The MLE of ( 0 , < 7 2 )' is then 


9-x, g 


S 2 

n ' °b serve that MLEs are functions of minimal sufficient 


equations! gfvin by' ^ C ° U ' d ^ ° btained by settin S U P Ukelihood 


dlogL ^ - 2£(x, - ft) 

98 2 CT 2 " "° 


and 


01 ogL 
0 cr 2 



l (*, - 0) 2 

2(7 4 


yielding unique solution 0 = £2 _ 

derivative is given by n 


Note that the matrix of second 


0 2 logL 

00 2 


0 2 log L 

^ 000 <7 2 


0 2 log 
0<7 2 00 


0 2 log L 

0((T 4 ) 2 > 




0 

- n 
2 g 4 


\ 


A A A 

(0.<X‘) 



Method of Maximum Likelihood 141 


which ensures that the likelihood is maximum at (0, a 2 )'. Observe that as 
seen earlier (0, <7 2 )' is CAN for (9, o 2 )' with asymptotic variance covariance 
matrix \ r\6, o 2 ) where J is the Fisher Information matrix. 


Example 7.1.2 Let (X h 


J-exp< 
2 a 


is given 


-lx-01 


}• 


xe 


by 


^n) be i.i.d. with pdf given by f(x, 0, &) = 
^i> 0€ /?i, <r> 0. Then the likelihood of a sample 

t 


log L(x, 0, a) - - n log 2 - n log <7- E I x f - 0 l/cr (7.1.6) 

The model corresponds to repeated direct measurements where errors are 
j.i.d. double exponential (Laplace) with scale a Again we adopt two stage 
maximization, first fix a then maximizing log L for variations in 0 which is 
equivalent to minimizing XI - 01 = 11 * (j) - 01 for 0 e R,. One can show 

ft 

that E I x (0 - 0 I is minimized for 
1=1 


a 

0 ~ *(m+n if n = 2m + 1 

= ^(m) + (1 - a) x (m+1) if n = 2m, for any a, 0 < a < 1 
We note that for n = 2m, 0 is not uniquely determined and we can take 

A 

9 = x (nni) in thls case also. Hence we define 0 = x ([n/2]+l) = Sample Median. 

A 

Then we maximize log L(x, 0, o) = - n log 2 - n log 5 - £ L*i ~ g| Bv 
usual differentiation technique we can show that < 7 = — E lx, - 01. Here 

H js 1 

also the MLE (0, < 7 ) is a function of minimal sufficient statistic which in 

this case is order statistic. Here (0, <7)' can be shown to be CAN by using 
the asymptotic distribution theory of linear functions of order statistics. For 
details we refer to David (1981). 

Example 7.1.3 Let {X f }" be i.i.d. Bernoulli with P[X, = 1] = 0, P[X, = 0] = 

* ~ 9 ‘ ^ en toe log likelihood is given by log L(x, 0) = E x, log j~ q + 

log (1 - 0). if x Xj * 0 or E X/ * n then using differentiation technique we 
ca n show that 0 = x. If E x, = 0 then log L(x, 0) = n log (1 - 0) and is a 
Monotone decreasing function of 0. Its maximum is attained at 0 = 0, but this 
in’* 6 ^ * S ° n boundar y tb e parameter space Cl = (0, 1). Similarly 

a v I** = n t * 1Cn 0) = n log 0 and its maximum is attained at 0 = 1, 

Ue on tb e boundary of the parameter space. Recall that in earlier chapters 


142 A First Course on Parametric Inference 


in the discussion of Cramer-Rao inequality, Fisher Information etc. we have 
assumed £2 = (0,1) an open interval. Thus there is some problem in defining 
MLE of 0 for certain samples as the maximum is attained on the boundary 
of £2. However observe that P(L X,- = 0 I 0) = (1 - 9) n is very small for large 
n as 0 < 0 < 1 and similar remark holds for other case namely P(£ x = 

n I 0) = 0". By convention we take the MLE of 0 as given by 9 = x in all 

- . . 0(1 - Q) . 

cases. Note that 9 is CAN for 0 with asymptotic variance-- = —L_ 

n nl(9) 

where 1(9) is the Fisher information. Ariother way to tackle the problem of 
defining MLE of 0 when I jt/ = 0 or X*/ = n is to define MLE as 

0i = £\ if X Xj = 0 

= x if 0 < X Xj < n 
= l - £ 2 if X Xj = n 

where Cj, e 2 are arbitrary positive numbers close to zero. We show that 

A A _ 

asymptotic distribution of 0] and 9 = X is identical. Now observe that 

M$i - 0) = MX -9) + yln(9 l - X) 

Now by CLT, Vn(X - 0) JV(0, 0(1 - 9)). Further 

P[ I M$i -X)\<e]> P[9\ = X] = 1 -(1 -9) n 

But for 0 < 9< 1, 1 - & 1 - (1 - 9) n —> 1 as n —> <». Thus Vn(0j - X) -^>0 
and therefore Vn(0 1 - 9) A^(0, 0(1 - 0)) and there is no difference 

A A — 

between the asymptotic distribution of and 0 = X. 

We close this section with noting some general properties of MLE. First 
suppose that {f(x, 0), 0 e £2} is such that there is a minimal sufficient 
statistic T then by Neyman Factorizability criterion 

L(x, 9) = II/(*,-; 0) = g(T(x), 9) h(x), V x e Sq,V9g Q. 

i=i u 

Since 9 the MLE is obtained by maximizing L(x, 9) for fixed x and f° r 
variations in 0, it is equivalent to maximizing g(T(x), 9) for fixed x or for 
fixed T(x) = t and for variations in 9. Thus 9 depends on x only through T(x) 
and the MLE is a function of minimal sufficient statistic T. Next suppose that 

9 —> 0(0) is an one-to-one nonsingular transformation. Now, if 9 is MLE 

0, then 0= 0(0) would be MLE of 0, the new indexing parameter. W e 
illustrate this by an example. 

Example 7.1.4 Let (Xj, ..., X„) be i.i.d. Poisson with mean A then 


Method of Maximum Likelihood 143 


L(x, A) = e 


- „-nk A^*' _ _ n A 


TtX: ! 


= e 


(nA) t . 1 v 

, t — ** X; 

n 1=1 


?! 


n x, i 


i=i 




f, 


Maximizing L(x, A) for A > 0 is equivalent to maximing the pmf of the 

minimal sufficient statistic 7 = I X„ given by g(t, X) = e~" x , t = 0,1, 

* • 

2 > ... for fixed t. By differentiation technique we can show that the MLE 

* t 

X= n =X - . 

Now consider the Poisson distribution with mean A given by the new 

_ /s 

indexing parameter 9 - log A or A = <? . Then as A = tin we would have 
0 = log (f/n) which can be verified to provide maximum of 

L { (x, 9) = L(x , e*) = exp {- ne 6 } - \ 

*• rr .. i n 


n X,! 

i=l 


Now log L\ — - ne 6 + tO + h(x), 


51og L, 


= — ne s + t and as 


<9 2 log L, 


d$ ~ - de 1 

A t 

-ne 6 < 0 the unique solution of likelihood equation given by 9 = log —, 
provides the maximum of L { (x, 9). 

In the next section we will first consider the problem of determining MLE 
for the situation where {/(*, 0), 9 e Q} is an exponential family. Recall that 
in defining exponential family we have considered Q, the natural parameter 
space as an open set and thus excluding the boundary points. Observe that 
the boundary points of £2 as in case of Example 7.1.3 correspond to degenerate 
distributions with all probability mass concentrated at a single point. This is 
a delicate mathematical point and we would not go in further details in this 

course. 

kxercise 7.1 (1) Let (X,, ..., X„) be i.i.d. exponential with location 6 and sale a The 
pdf is given by/to, ft <r) = -j- exp {- ^4 , * a ft 0 e R„ <7> 0. Show that 0 = X„, 


and CT = 11 IX- - X„0 = 1 £ t x <n - x (td- We have a)ready seen that (S ' ° Y iS 
n /=1 ' U) ' n i-2 

consistent but not CAN for (9, o)'. A * t . 

(2) For (X„ .... X„) i.i.d. m 9). o> 0 show that 8 = X ( „,. Again 0 ts cons,stent bu, 

not CAN for 9. 

1 ^ 

(3) Consider Example 7.1.1 and reparametrize by 02 - " 2<j 2 &n< ^ <J 2 


that * = " and 0 2 = -^. 
2 a 2 r cr 2 



144 A First Course on Parametric Inference 


7.2 MLE in Exponential Family 

Let (X b ..., X„) be a random sample size n on X where the pdf of X bei 0 
to one parameter exponential family so that log f(x, 6) = u(6) K(x) + 
w(x), X e S, 6 e Q. Then the likelihood L(x, 6) = u(0) I £(*) + m , r J + 

<91oeL ^) + 

I w(Xj) and the likelihood equation is ^— = m'(0) I £(*,-) + nv ' ^ = 
This is same as the moment equation based on sufficient statistic, gi Ven k 

2 1 r * 

i I K(*i) = - £ = m = mm Further — | f = «"X ATfe) + 
and 


<9 2 logL N 
dd~ 2 


f 2l 


= u"{8)n 


Je=e 


-v'(d) 


+ nv"(Q) 


u'(0) 


u 


nl(6) 


where 1(6) is the Fisher information per unit observation. Therefore, 
<? 2 log 


<961 


< 0 and 6 the unique solution of the likelihood equation is 


e 


MLE of 6. Note that as already seen in Sec. 6.2, = u(6) K(x) + v\6) 

du 

and E {^j¥~) = 0 im P lies that W) = Further = 

u" (0) K(x) v”(6) and 


/(0) = E 


d 1 log/^l 


~u'v" - v'u"~ 

w J 

* 

e 

u' 


A 

We note that 6 the MLE here is same as the moment estimator based on 
sufficient statistic and by Theorem 6.2.2, 6 is CAN for 6 with asymptotic 
1 


variance 


nl(6) * 


Theorem 7.2.1 For distribution belonging to one parameter exponen 

family the MLE 6 is CAN for 6 with asymptotic variance • 

The above result can easily be extended to m-parameter exponential family 
to show that the MLE 6 is CAN for 6 with asymptotic variance co-variant 

matrix i J~ l (6). Further the MLE is same as the moment estimator based of 
sufficient statistic T = (T h .... TJ where T r = - £ Kffl. 

ft y=l 


Method of Maximum Likelihood 145 


let (Xi. XJ be a random 

exponential family with pdf 


sample of size n from an m-parameter 



Wh< f Y . Yv Consider the parametric transformation » = « (6) 

s ° . ^ ‘ S Can0 " iCal Parame,er Then the likelihoods 


log L(x, 0) = n 


I 


r=l 


0r T r (x) + nu 1 (0) + X w(jc,) 


The likelihood equations are 


<9 log L 
(90 r 


= "fr. 


TAx) + 


dvi 

d<j>\ 


= 0,r= 1,2, 


m 


and 


d 2 logL d\ 

d<t> r d<t> s n <90 r d0, = “ n JfS W 

A 

The MLE 0 is now the solution of moment equations given by T r (x) = 

" = E ( T r ( x ^’r = 1,2,..., m. As 7 = 7„(0), the Fisher information matrix 

about 0 is pd, it follow that 0 provides the maximum of the likelihood of 
0 Again retransforming to 9 r = hAfa, .... 0 M ), r = 1, 2, ..., m where H~ x is 

the inverse transformation and from results of Section 6.4, it follows that 9 

is CAN for 9 with asymptotic variance covariance matrix — J~ [ (9). From the 

invariance property of MLE under 1:1 transformation discussed in Sec. 7.1, 
we note that 6 is indeed MLE of 9. 

In the above discussion it has been tacitly assumed that 9 the MLE, the 
unique solution of the likelihood equations belongs to the parameter space 
Q. Recall that as per the regularity conditions for exponential family, the 
natural parameter space £2 is an open set. Now log L(x, 9) for fixed x as a 
function of 9 varying over £2 is continuous and as £2 is an open set the 
Maximum of L(x, 9) or log L(x, 9) may not be attained at an interior point 
°f f2 but may be only on the boundary of £2. We will allow this to occur and 

re gard 9 which may be on the boundary of £2 as MLE of 9 although 
strictly speaking MLE of 9 is not defined in this situation. We illustrate this 
P°mt by an example 

Example 7.2.1 Let (X,. X„) be i.i.d. b(l, 0) where £2 = (0, 1). 



146 A First Course on Parametric Inference 


Then L(x, 8) = fl 8 X ' (1 - 8 )'-« = 8‘(\ - 8 T', where t=t x 

1=1 M r 

The likelihood equation is = J~" 0 g) = 0 and gives the solution 

& = n = *‘ However for t = 0 or t = n the likelihood of the sample is given 
by 

L(0, 0) = (1 - fl)» 

1 ( 1 , 0 ) = e n 

, d log L 

an o = 0 has no solution in £2. We however take 


0=0 if X X{ = 0 
= 1 if Z Xj = n 

which are values on the boundary of £2. 

We observe that the boundary points 0 = 0 and 0 = 1 correspond to a 
degenerate random variable. 

Note that the pmf of X belongs to one parameter exponential family with 
M*) - x so that the method of moments has also the same problem since in 
this case the moment equation and likelihood equation are identical. We 

re er to discussion in Example 6.4.2 and Example 7.1.3 where similar problem 
had arisen for method of moments. 

The distinction between whether £2 is open or closed set is mathematically 
important ut in practice may not be very crucial. However asymptotic 
istribution of the estimator at the boundary point may be non-normal. We 
illustrate this by way of an example. 


Example 7.2.2 Let {-X/)i be the i.i.d. N(n, 1) where apriori it is known that 
H>0 so that £2 = [0, ~). The log likelihood of the sample is 

log L(x, p) = c- —~^ 

2 

- 0gL -n(x - u\ j ^ 2 log L 

dp- " an ^2 = “at any n > 0 and at p = 0 we consider 

differentiation w.r.t. n from right. For observed s • [0, -), fi = x but if 

* <0 the " dfi < 0 for any fie [0, ~) an d log L is decreasing. Therefor' 

maximum of log L occurs at // - n u i 

P 0. Hence, we define the MLE to be 

P\-x if jc > o 
= 0 if x < 0 


i 



Method of Maximum Likelihood 147 


Now consider 


Vn(/Z! - jU) = yln(x- jz) + ^n(fi x - x) 


Now Vn(x - fi) -=-> N(0, 1) for any /u e [0, oo) and 

?[Vn(/ii - x) I < £] > P[/x, = x] = P(x > 0) = 1 - <P [- \ In/x] 

i ( 1 

For any /z > 0, 1 - G> (- vn/x) —> 1 as n —> ©o and therefore /Xj ~ AN fi, — 

\ n y 

for any jtz > 0. But at jtz = 0, P[x > 0] = 1/2 and the d.f. of Vn(/x, - 0) at 
11 = 0 is given by 

G„(w) = 0 if u < 0 

= <P[Vnw] if u > 0 

which is not normal. Thus, Theorem 7.1.1 holds for all values of jtz e Q = [0, °°) 
except p = 0. 

We also note that MSE (/2j) < MSE ( X) for any jtz e [0, °©). This follows 
for the fact that 

MSE (/i i) = [ (x - H) 2 g(x, H) dx + [ n 2 g(x,ii)dx 
Jo J-°° 

moo p0 

MSE (X) = (x - /x) 2 s(*. H)+\ (x- ^) 2 «(*» M) dx 

Jo J - 00 

Therefore 

r° 

MSE ( X) - MSE (fid = x(x - 2fi) g(x, /x) dx (7.2.1) 

•/— oo 

Now as x < 0 over the range of integration and as /x > 0 we have (x — 2/x) 
< 0 over the range of integration and therefore the RHS of (7.2.1) is non¬ 
negative and 

MSE (X) > MSE (fit) for any /z e [0, <*>) 

On the other hand if we delete point fi- 0 and consider the parameter 
space Q. = (0, oo), Theorem 7.1.1 holds and we can take the MLE of n to be 

X as the asymptotic distribution of X and fi\ is same for /x e (0, °°)- The 
inclusion of /x = 0 in ft or otherwise is a problem of model specification and 
We will not go in to further details here. 


Exercise 7.2.1 (1) Let {Xtf be i.i.d. b( 1, 9) where a priori it is known that 9 e 
Show that technically MLE of 9 is 


1 2 
4’ 4 



148 A First Course on Parametric Inference 


= T ifJ ^T 
4 4 

=* ifis (i’| 

= i i«>i 

4 4 


Show that 0 


- An[b, for any 6 € (i |] bu, a, 9= I and 6= i, the ML£ 


0i is not asymptotically normal. Show that MSE(0,) < MSE( X ) for any 0 e li 3 ' 

4’^ • 

J ® (X " •••• *”> 66 Ud - Poisson with mean A. Show that A* = * is MLE «*. 

* > 0 D, souss the case when x = 0 and show that When 

A 

&i = x ifx > 0 

= £ if x = 0 

is such that Mi,-X)^N(0,\). 

(3) For the sample of size n from bivariate pmf/fct, y, A, p) = A' (*j p > ( , _ pr 

on'crit^’casesri^O. x'^O Id 

7.3 Cramer Family 

ottTpa~ th exnn 6 Tu ° f maXimUm likelih °° d "»« applied to 
asymptotic variance equal’to CRL^CramS' eStimat ° r ° f * 
a larger family of distributions, or models under'considerati^K* 1 * 1 *° 

regularity conditions on the family in a suitable wav The fa m"’ fj —l 
satisfying Cramer rpo.iiaritx, j- uuaDle wa y* The family of distributions 

we wmshoXheexamlfr 110 ^ Wi " ^ called a Cramdr family and 

6 is a real parameter. y ‘ Wl flrst cons ider the case where 

thus hjvc a random Qntnnio _r * 

belonging to the family [f( x 8 ) fl e o Tr \ r ,X,) ' “* UA W ‘ th ^ 

regularity conditions are satisfied in an o ' Cramer rcquired that “T 
is the "true value” of the «r • e C ° whe " * 

distribution indexed by ft, Since N to v andom sample is drawn from the 
that 6 0 is an interior point of Q and w Pr ° Perly contained in n il folloW! 
where the indexing parameter ft is m 1 T " 0t sam P lin 8 from 3 distributior 

assume that fi is an P o^s« o? R °f We thus c.n .s wel 

that Fisher information 7(0) i s strictlv ^ Cramer editions were to ensure 

an additional condition on the third y P .° Sltlve and finit e at each 0 e ft anc 

on the third order derivative of log^, 9 ) w.r.t. B 


Method of Maximum Likelihood 149 


Cram&r regularity conditions on the family {/(*, 9), 9 e Q a R { ) are then 
as follows. 

C-l : Support S e = {x \f(x, 9) > 0} does not depend on 9 so that S e = S. 
02 : The parameter space Q. is an open interval of R { . 

C-3 : log/(*• «) >s such that and exists for 

almost all values of * e S for every 9 e N p (9 0 ) for some p > 0. 

C-4: J 9) dx—\ can be differentiated twice under integral sign w.r.t. 


0 so that E 


f dlogf ) 
{ 99 


= 0 and E 


/ 


log/ 

99 


\ 2 


f 


= E 


9 2 log/ 

00 2 


= m. 


the Fisher information is positive. 


C-5 


<? 3 log/ 

00 3 


- M(x) where E£M{x) < <*. Here M(x) may depend on 


0 O and p. 

It is easy to verify that the above regularity conditions hold for any one 
parameter exponential family when the parameter space Q is restricted to an 
open interval. It may be pointed out here that the family {N(p, 1), p > 0} 
considered in Example 7.2.2 will not be a Cramfcr family, but its sub-family 
obtained by deleting the point p = 0 would be a Cramer family. 

As an example of a family which is not an exponential family but is a 
Cramer family consider the Cauchy pdf given by 


f(x ' ■ ¥ \ + (x-n) 2 ’ x e *" M € Rl 
Here conditions C-l and C-2 hold and 


log f{x , p) = - log n - log [1 + (x - p) 2 ] 


Now for fixed x, 1 + (x - p) 2 being a polynomial of degree two is an analytic 
function of p and therefore log [ 1 + (x - p) 2 ] is also analytic function of p 
and C-3 is satisfied. The checking of validity of differentiation under integral 

sign is a little complicated. But we observe that = - —— - 

9p k [\+( x - p) 2 ] 2 

s itself continuous function of p and is an integrable function over any 
ni e interval (a, b) and this integral is uniformly convergent as a 


and b + since the integrand behaves like -y near ± <». Therefore 
f fix, p) dx = 1 can be differentiated under integral sign w.r.t. p once to 




dx = 0. Consider second order differentiation. Now 


150 A First Course on Parametric Inference 


H-l 


-2 


dfi 2 n [1 + (x - n) 2 ] 2 n [1 + (x - ill) 2 ] 


+ J_ S(x- n) 


2i3 


which is continuous and integrable over any interval (a, b ) and this integral 
is uniformly convergent as a —» - <» and b —> + °° since the two terms on 


1 


RHS behave like near ± ©°. 

x 4 


It now easily follows that E 


(^iL) = 0 and 

dn ) 


m 


= = f fill 2 1 


l J 1,1^ J / 


dx 


= A f (*-/0 

* J*, [i+(* - jo 2 ] 3 

Putting f = w we have 


dx 


-if—£ 

* Jo (1 + r 2 ) 


2\3 


rdt 


W 


1/2 


du- = i 0(3/2,3/2) = I 


'o (1 + *v) 4 

Now by straightforward calculations we can show that 


d 3 log/ = -12(x-/Q 16 (jc - /x) 3 

dll 2 [1 + (x - n) 1 ] 2 [1 + (x — jU> 2 ]2 


Again J 


f 

<? 3 log/ 

J/?l 

dfl 2 


± oo and in fact here E 


f(x, ji) dx exists as the integrand behaves like -==- nea 

x 2 

^ 3 log/ N 


dH 


= o, for any p e W p (p„), 

^ * 

To emphasize the fact that the conditions C-3 and C-5 need hold on 
in N p (Q 0 ), consider (X b X n ) as i.i.d exponential with mean 0 so th 


log/(x, 6) = - log 0-J-. Then 


jj°g/ = _i , x 
0 F 


00 


£>g/ =+ _L 2x 
e 2 e* 


d@ 2 

3 


d0* $ 3 04 


Now for 0 s (0 O -P, e 0 + P) where 0 o _ p>o 








Method of Maximum Likelihood 


151 


(9 3 log/ 
<90 3 




O' ' o'* (e 0 2 -p? + je~i = M(x) 


\ 2 60 

and E^M(x)) - ^ ^ exists for any 0. 

Note that the function ^ for 0 e a = (0, =») is not bounded as 2_ 

can be made arbitrarily large by selecting 9 sufficiently small. 9 

As an example of a family of distributions which is not a Cramfer family, 
we consider the double exponential or Laplace distribution with 


A*. 0) = ^ exp {- I jc - 9 1 }, * e R h 0 e 


Here although C { and C 2 , hold log /(x, 9) = - log 2 -1 * - 0 I and as 
_ | jc - 9 1 is not differentiable function of 9 for fixed x at 9 = x we have a 
problem. Note that as 9 varies over N p (9 0 ) the set of exceptional values of 
9 log/ 

x € Ri where ^ does not exist, would be (0q — p, 9q + p) and has 

positive probability under each 9 e R v Again any class of distributions such 
as [U( 0, 9), 9 > 0} or exponential distribution with location 9 with pdf 
/(x, 6) = exp {— (x — 0)}, x > 0, 0 e R[ will not be Cramer family since the 
range of the r.v. or support of the pdf depends on parameter 0 and C-l is 
violated. 

One can in a similar way define m-parameter Cramer family. Let 
{/(x, 0), 0 e ft c R m ) be a family of pdfs indxed 0 = (0 b ..., 9 m )\ This will 
be Cramer family if following regularity conditions hold in N p (0 o ). 

C-l Support S e = {x I /(x, 0) > 0} does not depend on 0 or S e = S. 

C-2 Partial derivatives of log />, 0) w.r.t. components of 0 exist up to 

third order for almost all values of x e S for every 0 e A p (0 o ) where 
0o is the true value of the parameter. 

^ /(•*, 9) dx = 1 can be differentiated twice under the integral sign 

j s 


f 


w.r.t. components of 0. So that E 


\ 


4 = E 


f 




<9log/ <9log/ 
<90 r 99 s 


\ 




= £ 


<9 log/ 
<90 

\ 

<9 2 log / 
99 r 99 


r J 

( ^2 i_ 


= 0, r = 1, 2, n and 


is such that the Fisher 


* J 


information matrix J = ((4)) is positive definite 


C-4 For any (0 r , 0 Jt 0,) r = 1,2, 1,2, ...,m,f= 1,2, ...,mand 


j9 3 log/ 

dO r 99 s 99 t 


< M rst (x) for any 9 e N p (9 0 ) where E^M rst {x)) < 





Parametric Inference 


152 A First Course on 

One can verify that an m-parameter exponential family is a | so a c 
family and as an illustration of a non-exponential two parameter f ami > 
consider Cauchy distribution with pdf *e 

> x e Ri, Lie R\, 0 > 0 


As an example of a family of distribution which is not a Cramer family 0ne 
can consider two parameter Laplace or double exponential distribution with 
pdf 

1 ( Ijc — 1/1] 

f{x, /t, 0) = 2 ^: exp j--—k xe R u fi e R h a> 0 

X. J 

Here it is interesting to observe that for each p, fixed (known) and o> 0 we 
have one-parameter exponential family and therefore also a Cramer family 
but for each a fixed (known) and fie R\ we have neither an exponential nor 
a Cramer family. For Laplace distribution CM and C'-2 are satisfied, but 
not the other conditions. When the support depends on 0, as in the case of 
U(/J. - 0 , /I + 0 ), CM is violated and we do not have a Cramer family. 
Cramer (1946), under the regularity conditions C-l through C-5, proved that 
with probability approaching one as n -> °o, the likelihood equation admits 

~ A 1 

a solution 0 which is CAN with AV(0 ) = . However, the questions of 

A 

uniqueness of solution and whether 6 provides the maximum of the likelihood 
were not addressed. Huzurbazar (1948) proved that for large samples with 

very high probability there is at least a relative maximum at 0 and the 
likelihood equation admits a unique consistent solution. Thus Cramer and 

Huzurbazar results showed that for sufficiently large samples, with probability 

close to one, the likelihood equation has unique consistent solution which is 

CAN with asymptotic variance equal to the CRLB = —1— In the next 

nl(9) 

section we will state and prove these results as a theorem and label it as 
Cramer-Huzurbazar Theorem. 

7.4 Cramer-Huzurbazar Theorem 

Let {f(x, 0), 0 € Q. c /?j} be a one parameter Cram&r family satisfying 
regularity conditions C-l through C-5. Then we have the following results. 

R—1. With probability approaching one as n —>«», the likelihood equation 
admits a consistent solution 0. [Cramer (1946)]. 

R-2: Vn (0 - 0) A^O, or 0 - AN^0, [Cramer (1946)1- 

R-3: With probability approaching one as n -> «>, at 0 there is a relative 




7f#7# 


f the likelihood or P 

x"*:',;,, . ii-j, 

inconsistent solution 6 of the likelihood equati™ • 

that the likelihood equation “*» 

*»\ § tends to zero as n -> « (Huzurbazar 19 48) stent solutl °ns 

Kf!- * " ““ i ‘* “* **“» * * 1 «i, „ 

j( ^ w -S«,^.W.i».o, s<p , u „fe«2 ibtfc 

. /ife at 0 and 0o- Then E S ..(Y) = f ./(•*> 0) f , . 
ratio of P dfs at ° 9,A ' J, Jur^i fa %)dx = 1. Now 

lnfl V Then as cp(m) = - log U is a CtrWl.._ 


tiO P 1 


fau u * j.v j c/ 0 j “ 1 • fww 

consist - log r. Then as <?(») = - log „ is a strictly convex function with 
/(«) C >0we haV£ by ,enSen ' S ine£ l ualit y Br log f) > - log E(Y) = 0 


and therefore 


£(- log k) = J - log J^TJJ fix, O o )dx>0 

Ee 0 dog Y) = E eo [log = _ ne t e o) < o 


(7.4.1) 


The quantity 7(0, 6 0 ) defined in (7.4.1) is known as Kullback Leibler information 
per unit of observation. We also note that by Taylor expansion of log/( jc, 6) 
around 0o 

logM 0) = logf(x, 0o) + (6- 6 0 ) 


(6- 6 0 ) 2 (d 2 log/ 


+ 0 ( 0 - 0 O ) 2 ( 7 . 4 . 2 ; 


E eo -log = (l^»2l/(0 o ) + O(0-0o) 

* 1 ( 8 , fib) = (6> W + 0(6 - 6o) 2 

N °w Let Y, = log ~" e J , then {l')? are i.i.d.r.v.s with E e „ (Y<) 
J\ x h »o) 

Therefore by WLLN, under 6 0 . 

Iy v - 1 i„„ U*J!L U - /(«. #)> < 0 


= -!(8A >) 


(7.4.3 



154 A First Course on Parametric Inference 
Consider the sets 


Then by (7.4.3) under 0 O , P 6{) | 


&n = {■* I log L(x, S 0 - 8) < log L(x, 0 O )} 
F n - [x I log L(x , 6 0 + 8) < log L(x, 0 O )} 

L(x, 6) 


log 


L(x, e 0 ) < °j 1 as n 00 and thus 


^ (7qj j Ulus 

P(E n ) -> 1. Similarly we have P(F n ) -» 1 and therefore P(E n F) ^ 
as n -* Thus for any , « E„ n F„, log L(x, e 0 -S)< log £(*, e o) a „d 
log L(x, 0 O + 5) < log L(x, 0). Since the function log L(x, 0) is continuous 
it must then first increase and then decrease as 6 varies over the interval 

[0 ° ” 9 ° + ^ The rofore, ~ ^ L is initially positive and then negative. 

In view of the regularity condition C-3, is continuous in 0 and it 

now follows that there exists a 0 e [0 O - 8, 0 O + 8\ such that — ^ = 0 

at 0. This gives P[\0-d 0 \< 8] > P(E n n F n ) -> 1 as n ->and therefore 

0 is consistent. This show that with probability approaching one as 

the likelihood equation admits a solution 0 which is consistent. This completes 
the proof of R-l. F 

We next consider R-3. Expanding [ ar0 und 0„ we have 

V oO Jo 


J_ 

n 


( d 2 log L 

C~de T ~ 


Je 


1 

n 


( 


<9 2 log L 


de 


\ 


Je o 


+ (0 - 0 O ) — 
n 


( n 3 


V 


<9 3 log L 

<90 3 


\ 


(7.4.- 


e 


where 0* e (0 O - 5, 0 O + 5). 
Now 


1 

n 


fd l lQ g L \ _j_y f ^ 2 log/(x,- 0) 




<90 


'0o 


n 


V 


<90 


^ i " 

= i 2 (/, 


' 0Q 


n ,=i 


where t/,- are l.i.d.r.v.s with £(£/,) = - /(fl 0 ) and by WLLN, the first tei 
RHS of (7.4.4) converges in probability to - /(0 O ). Now (0 - 0 O ) 

1 ^<9 3 logL^ 1 


by C-5 


I 

n 




de 


' 6 * 


1 " 

~ n ^(^/). Since E(M(xj)) exists, by 


n £ M(xj) 4(0 O ) and therefore — -- p y B(6 q). Th 

the second term on RHS of (7.4.4) converges in probability to zero 


Method of Maximum Likelihood 155 


i 

n 


f 22 


V 


d 2 log L 

de 2 


-1(9 o) 


(7.4.5) 


It now follows that P 


( 22 


V 


d 2 log L 

de 2 


<o 


—> 1 as n —» 00 as 


r d 2 \og_L 
dd 2 


x I 


\ 


<0 


2 (E n n F„). 


This shows that with probability approaching one as n there is a 

A 

relative maximum of the likelihood at 0 . This completes the proof of R-3. 
We now prove R-4 by the method of contradiction. Let xe E n n F n and 

A A 

if possible let 0i and 0 2 be two distinct consistent roots of the likelihood 

A A A 

equation. Then by Rolle’s theorem there must exist 0 3 = a6\ + (1 - a) d 2 


e N$(6 q) such that -J- 

f V 


r 22 


v 


d 2 log L 

de 2 


\ 


03 


= 0. However expanding 


( 22 


d 2 log L 

le^ 




J 


03 


around 0 O , we show as in the proof of R-2, that — 

w 


(22 


V 


d 2 logL 

de 2 


\ 


-1(9 b). 


03 


Thus we have a contradiction and the likelihood equation can admit two 
roots in /Vg(0 o ) only on the compliment of (E n n F n ) the probability of which 
tends to zero as n —> <». This completes the proof of R-4. 


We now consider the proof of R-2. Again we expand ^ 
0 O by Taylor series. Then for x e E n n F IV 


V 


d log L 
~de~ 


\ 


around 


0 = lfe^ +( 0_„ o) if* 2 >°g^ 


n ^ de 


ny de 


0o 


de 


0() 


(0 ~ 0 q ) 2 1 

2 n 


( 


0 3 log L 


\ 


e’ 


where 0' e JV g (0 o ). 

After some routine algebra, we obtain 

f d log L 


Me - e 0 ) = -}- 

in 




de 


D 


where O = - — f — ^ ^ 
~ ' n\ de 2 


00 

2 . ( 2l 


00 


(0- 0q r i 

2 n 


V 


log L 
00 3 


0' 


Now -L f dlogL 

Vn 1 de 


\ 


(7.4.6) 


0o 


= -L I where E(W,) = 0 and Var (W t ) = /(0 O ). 
in i=i 




156 A First Course on Parametric Inference 


Therefore, by CLT, L - 

\n »'=i 

-i i (-(/,) 


N(0, 1(9 q))- Similarly, by WLLM 


n d9 


I(6q) as (/, are i.i.d.r.v.s 


with 


w w 

E(Uj) = - /(0 O ). As shown in R-2, the second term in D would converge i n 
probability to 0 as n Therefore D /( 0 O ). Therefore RHS of ( 7.4 6 ) 

converges in distribution to N^O, j. This shows that Vn(0 -6) 
7(0)) which completes the proof of R-2. 


Remark 7.4.1 The general method of proof used above can be summarized 
as follows. Suppose we want to prove that a certain mathematical property 
Q holds under conditions say (A), (B) and (C). Then for (*,, ..., x n ) in the 
support of the joint pdf of (X h ..., X n ) we determine set H n where conditions 

(A), (B) and (C) hold. If P(H n ) -> 1 or equivalently P(H c n ) 0 as n ~ 
then we say that the property Q holds with probability approaching one as 
n -4 00 . From the proofs of R-l and R-3 it is clear that for xe E n n F„, the 
log likelihood function log L(x, 0) is convex in N^9q) and therefore has a 

relative maximum at some point 9 € N^9q). From the regularity conditions 
it follows that ^ = 0 and f—< 0 and as P(E n n F n ) -> 1 

' J 0 

as n * 00 th ese properties hold with probability approaching one as n —> °°- 
An extension of Cramer-Huzurbazar theorem for m-parameter Cramer 
family was first given by Chanda (1954). Under the regularity conditions 
C'-l, through C'-5, it was shown that 


• With probability approaching one as n —» «>, the likelihood equations 
<2 log L 

- u , r - 1 , 2 , ..., m admit a consistent solution 9. 

R'- 2 : V/i (0 -9) rf m) ( 0 , r\ 9 )). 


R'-3 

R'-4 


With probability approaching one as n 00 , the matrix 
is positive definite at 0 . 



0 is essentially unique or the probability that the likelihood equati° n 


admits two consistent solutions 0 ,, 0 2 tends to zero as n -* °°* 
Although the extended Cramer-Huzurbazar theorem was extensively use ^’ 
there was a feeling in the statistics community that Chanda’s proof was no 
completely correct. The first specific counter example to some argument 
used in Chanda’s proof was given by Tarone and Gruenhage (1975)- 
problem was pinpointed by observing that Rolle’s theorem is not valid 
higher dimensions. The situation was rectified by Foutz (1977) by P r ° vl 


Method of Maximum Likelihood 157 

Crambr-Huzurbazar Theorem in m-dimensions m > 1 by using the implicit 
function theorem. 

As this is a first course we will only present an outline of the proof and 
for more details refer to Foutz (1977). For a thorough treatment of the 
implicit function theorem we refer to Apostole (1994) and Rudin (1964). 
dlogL ^ 

Let/i r = ^ ’ r ~ m and/r = (/i b ...,h m )' then h is mapping from 

5 " x Q m to R m • The question then is under what conditions h(x, 0 ) = 0 has 
a unique solution, for the fixed x = xq, given by 9 (xq). The implicit function 
theorem shows that under certain regularity conditions there exists an open 
set U c S n and an open neighbourhood of 9 0 , N^6 0 ) for which 
h(x, 9) is one to one function and hr 1 exists. Then for each xeU, h~\ 0) exists 
and determines a function G (x) such that h(x, 6 (x)) = 0. Apart from the condi- 
dh 

tion that r ~ L 2 , m, s = 1 , 2 , ..., m exist and are continuous the 

following two conditions are required for the validity of the implicit function 
theorem. 

(A) For each fixed x, the vector 0 € R m , is in the range space of h. 

(B) The Jacobian H = is non-singular at G = G 0 . 

o{G \,..., G m ) 

Now identify h with the vector of score functions and note that C'-3 

dh 

implies existence of continuous partial derivatives . We now show that 

conditions (A) and (B) hold with probability tending to one as n -» «>. 
Therefore the implicit function theorem holds with probability approaching 
on6 as n —^ 

Let 0 O be the “true value” of 0 and consider the random vector h = 

. Then by WLLN, for each r = 1, 2, ..., m, 

V dG m L 


( d log L 

n do r 

as n —> oo i 
N s%). 


0 as n -» «• 


. Therefore with probability approaching one 


the zero vector belongs to the range space of h as 6 vanes over 


Similarly consider h„ = then as ’ ,he 

matrix H = (- is positive definite and is therefore non-singular 

wi,h Probability^approaching one as« -> -• Therefore both conditions (A) 
! n( * (B) hold with probability approaching one as n —» 00 an t e imp ici 
U nc ti° n theorem holds with probability approaching one as n -» 00 • 



158 A First Course on Parametric Inference 


We remark here that even in exponential family, in order that 0 is well 
defined we need the condition that for the likelihood equations or equivalently 
the moment equations based on sufficient statistic T given by 

7 E TAXi) = E(TAx)) = T)A6), r = 1 , 2 , m, 

n 1=1 

(i i V 

the observed vector - Z I, must belong to ranee 

^ n h J ^ 

space of ( 77 K 0 ), T] m (0))'. This condition may hold only with probability 

approaching one as n ©o as was observed in Example 7.1.3 for m = 1 . For 

m- 2 one can refer to discussion in Example 6.4.2. For multinomial distribution 

in k-cells such a problem would arise if some of the observed cell frequencies 

are zero. 

7.5 Multinomial with Cell Probabilities 
Depending on a Parameter 

Now consider a multinomial distribution in fc-cells with cell probabilities 
depending on a real or vector valued parameter 0 . These models arise in 
genetics quite often. A Classical example is given by Rao (1973) on frequencies 

of blood groups O, A, B and AB. Thus we have here k = 4 and the cell 
probabilities are given by 

p (°) = ^ P(A) = p 2 + 2pr, P(B) = q 2 + 2 qr P{AB) = 2pq 

Here p + q + r = 1 so that cell probabilities depend on a two dimensional 
parameter. The above probabilities can be obtained based on assumption of 
random mating and that O is recessive to A and B and the relative frequencies 
of O, A and B are r, p, q respectively. Another example is where the relative 
frequencies of G and g are 0 and 1 — 0 respectively. The progeny would then 
be classified as GG, Gg, gG and gg and under random mating the probabilities 
of four cells corresponding to four genotypes are 

P{GG) = 02, P{Gg) = 0(1 - 0) = P{ g G) and P(gg) = (1 - 0 ) 2 

Note that the well known Mendelian hypotheses 9:3:3 :1 corresponds to 
0 = 3/4. If the physical attributes of Gg and gG are same then the distribution 
is multinomial distribution in 3 cells with cell probabilities 0 2 ,20(1 - 0) and 
(1 - 0) 2 respectively. 

Another context in which such a multinomial distribution arises is when 
we have a grouped data. For example consider an income distribution modelled 

by Pareto distribution with pdf/(x, A, o) = x > a> 0, A > 0. Many 

a time exact income data is not available and we get a grouped data by 
income being reported as belonging to one of the intervals say [ 0 b> 2<J °!' 
Wo, i<T 0 ), [kc 0 , oo) when a = <j 0 is known. Otherwise the data i* 


Method of Maximum Likelihood 159 

sported as frequencies of the income intervals given by [ao, ai),[«,, oj),.... 
where 

0 = a 0 < a x < a 2 ... < a k _ x <a k = oo 

In the second case we have a multinomial distribution in Jfc-cells with cell 
probabilities given by 

Pi = P( 0, a { ] = | /(*, A, cr) dx = 1 - {alaif 

. *- 2 


Pk = 


In general the multinomial distribution in fc-cells with cell probabilities 
depending on a parameter 6 has pmf 


f(x, 9) = n [p r {0)Y r 

r-\ 


(7.5.1) 


* k 

with x r = 0 or 1 and Z * r = 1 and 9 e Q. is such that p r {6) > 0, Z p r (d) = 1. 

r=1 r=l 

k 

This gives log/(*, 9) = I x r log p r (0). 

r =1 

Suppose 9 is real and Q is an open interval. As S the support of/does not 
depend on 9 the conditions C-1 and C-2 of the previous section hold. For 
C-3 we now assume that log p r (9), r - 1, 2,..., k have derivatives up to 3rd 
order. Then 

d log f{x, 9) d log p r 

de ~Z ' ide 

and as Eix r ) = p r i9) and 

( 9\ogf\ i <9logp r _ y dpr 
E { 99 )~M Pr Z 99 rti 99 

k k (d log 

But as Z Pr i9)= 1 for any 9e Q, Z -gr =0. Hence £ =0.Next 

r=l r-\ OV \ 00 J 

consider 


Pplog/V j. .f^logp.y . jl2££: jl0 «P 

TtH “5' r ~3»"J - 56 99 


Bul £(*1) = pX&) and xj r, = 0 when r*s. 





160 A First Course on Parametric Inference 


Therefore 


<?log / 


= 'Lp r 


d\ogp r V 


Similarly 


<? 2 log/ _ y „ f d 2 log p r 

de 2 


[ g 2 l °g/ > |_ v r 0 2 log Pr 

r \~w~ 


From Z p r — - = 0 it follows that 


T „ ^ log p r • £ f <9l0g/? r V 

l p '~!e- + li‘{-§r) =° 


Therefore, Fisher information 

Thus C-4 holds. Next consider 


= - I 


^ 2 log A 


d 3 log/ 

00 3 


= 1 


Now as x r = 0 or 1, we have 


<? 3 log/ 
00 3 


^ 3 log Pr 
(90 3 


y T 0 3 log p r A 

5 = c<0) 


bounded tha ',u C 5 h ° ldS We "° W reqUire that C(8 > for 9 « N pW mus ' 
bounded so that C(9) can be taken as MU). With this condition 

a real nTramai^iJ ^mV 11 *" Ce " s wit * 1 cel1 probabilities depending 
real parameter 9 w,ll lead us to Cramer family and the results of 

pervious section hold and the MLE 9 can be obtained as a solution of 

1 * ° 0d equatlon whlch essentially unique and is CAN with AV( 'b 

nl(9) can prove these results under much weaker assumptions 
also for Q = (d. a ^ 


(,973) for further de,ails - w< 

Example 7 5 1 rv.„ •. , 

26(1-0) and P(L = ‘ n'*?* 1 « lven by P(GG) = fl 2 , P(gGo 

) when Gg and gG are not distinguis 






Method of Maximum Likelihood 161 

Then log f(x, 0) = *2 log 2 + (2*, + x 2 ) log e + (2xj + ^ , og ( , _ 9) 

where 0 < 0 < 1, *1 + x 2 + x 3 = 1 and x, = 0 or 1 i = 1 2 1 
Now observing that 2x, + 2x 2 + 2x 3 = 2 w e hale ' 

(2*, t * 2> + <2jt3 + * 2> = 2 or 2*3 + X 2 = 2 - (2x, + * 2 ) 

Therefore we have 

logoi’ * 2 ’ 6) = * 2 108 2 + (2 *» + * 2 ) [log 0 - log (1 - 0 )] + 2 log (1 - 0 ) 

which can be shown to be a one-parameter exponential family with u(6) = 

log 0 - log (1 ~ 0), k(x) = 2x { + * 2 , u(0) = 2 log (1 - 0) and w(x) = * 2 log 2 . 
Since x, - 0 or 1 with x l+ x 2 <\ the support of S does not depend on 0 and 

0 = ( °’ D iS ° Pen in Further «<» = > 0 over a Further 

(1, 2x\ + x 2 ) are linearly independent as can be seen by following argument. 
Let a + b(2x x + x 2 ) - 0 for all (*, , x 2 ) e S. Then taking jc, = 0, x 2 = 0 we have 
= 0 and taking x { - 0 and x 2 = 1 we have b = 0. Thus the pdf belongs to 

one parameter exponential family with T(x) = £ (2x u + x 2i ) as complete 

sufficient statistic. The likelihood equation is same as moment equation given 
by 


which leads to 


f I 

Z (2x u + x 2i ) = 2n& + 2n0(\ - 0) = 2n0 


e=±i ( 2x li+ x 2l) -- 2 -^ 


where (n h n 2 ) are observed cell frequencies. The Fisher information 1(0) can 
be obtained in a straightforward way as 

d log/ _ 2x t + x 2 2*3 + x 2 

~ de ~ e i - 0 

<? 2 log/ _ 2xj + x 2 2 x 3 + x 2 ~ 

00 2 " [ 0 2 '(l-fi) 2 . 

Taking expectations we have 

20 2 + 20(1 - 0) 2(1 - 0) 2 + 20(1 - 0) 
/(<9) “ 02 (1 _ 02) 


2 ■ 2 _ 2 _ 
“ 0 1 - 0 00 - 0 ) 


Thus 0 ~ AN \0 W “ g> 
l ’ 2n 









162 A First Course on Parametric Inference 


We leave it as an exercise to the reader to verify that Var (0) = _ 0 ^ 0) 

' 1 ^. - 

for each n, a result that can be proved by using variance covariance structu 
of (#*!, n 2 )' in the multinomial setup. Note that E(6) = 0 and as 6 i s 
function of complete sufficient statistic it is MVUE of 6 attaining CRLB to 
its variance. 

Suppose gG and Gg are distinguishable then we have multinomial 
distribution in 4 cells with cell probabilities p\(0) = 0 2 , p 2 (6) = 0(1 - _ 

p 3 (0) and p 4 (0) = (1 - 0) 2 and data would consist of cell frequencies (n h n 2 
n 3 , n 4 )' with X n, = n. One can show that in this case 


log fix, 6) = (2;q + x 2 + x 3 ) log 6 + ( 2*4 + x 2 + x 3 ) log (1 - 0) 

= k(x) log yz~q + 2 log (1 - 6) where k{x) = 2x x +x 2 + x 3 


since 2x x + x 2 + x 3 = 2 - (2x 4 + x 3 + x 2 ). 


Again we have one parameter exponential family with u(0 ) = log ——— and 

1 — 0 
n 

T(x) = X ( 2 * 1 ; + x 2 i + x 3i ) as complete sufficient statistic. Routine calculations 
1=1 


will show that §=— ' + ^ 2 + ” 3 and 1(0) = so that AV{0) = 

Q(\ — 0) V ^ 

— 2 ^— which is also the exact variance of 0 . 


Example 7.5.2 Fisher (1954) had used a multinomial distribution in 4 cells 
to analyse Carver's data on two varieties of maize classified as starchy vs 
sugary and further cross classified with the colour Green and White. The cell 
probabilities depend on a parameter 0 e (0, 1) which is known as a linkage 
factor. The probabilities are given below along with the observed frequencies 
as obtained in an experiment. 


Starchy 

Sugary 

Green 

White 

Green 

White 

2 + e 

1 + 9 

1 + 9 

9 

4 

4 

4 

4 

1977 

906 

904 

32 


Note that the log likelihood of (nj, n 2 , n 3 , n 4 ) for a sample of size n is 
given by 


log L - Constant + n { log (2 + 0) + (n 2 + n 3 ) log (1 - 0) + "4 ]o % 6 
Further log pf0), t = 1, 2, 3, 4 are analytic functions of 0 € (0, 1) afl< ^ 

— Qg ^ _ n i _ (m 3 + n 2 ) n 4 
d0 2 + 0 l_0 + -0- 



Method of Maximum Likelihood 163 


<9 2 log ^ 
dd 2 

0 3 log L 
90 3 


. w « 4 

(2 + 0) 2 (i -ey ' 1—— 

_ ~ _ ^(^2 + Hi) 2/14 

_(2 + 0 ) 3 (1 -0)3 ~ "03- 


Now the likelihood equation is given by = o or 

0 ( 0 ) = n& - 0 {nj - 2 (/i 2 + n 3 ) - n 4 } - 2 /i 4 = 0 

and routine calculation will show that the Fisher information for the sample 
is given by 

(6) 20(1 - 0 ) (2 - 0 ) 

The family of distributions is a Cramer family but not an exponential 
family. Observe that 


0 3 log/ 


2x i 


+ 2 (x 2 + x 3 ) 2 x 4 


00 3 I (2 + 0) 3 ' (1 - 0) 3 ' 0 3 

and for any 0 e ( 0 O - 8, 0 O + 8) 


0 3 log/* 
00 3 


< 2 *i . 2(*2 + * 3 ) 2jc 4 _ 

" (2 + 0 O - < 5) 3 (1 - 0 O - 5) 3 ( 0 O - < 5) 3 W 


and E e (M(x)) < °° and C-5 also holds. 

To obtain MLE of 0, we observe that the likelihood equation is quadratic 
in 0. The roots of the quadratic equations are 


-b± V 0 2 - 4 ac 


2 a 


where a = n, b = - n { + 2(n 2 + n 3 ) + n 4 and c = - 2 /i 4 . 


Now there is a possibility of both roots being real or both complex. 

d ^ lo Ij 

However observe that - — < 0 for any set of frequencies (n h n 2 , n 3 , 

n *y> i-e. log L is concave and further 0(0) = - 2n 4 < 0 and 0(1) = 3 (h 2 + n 3 ) 
> 0 for most of the vectors (n\, n 2 , W 3 , and therefore there is exactly one 
root in ( 0 , 1 ) the other root being negative as the product of the two roots 
is ~ 2 « 4 < 0 . We therefore take the positive root as the MLE. Note that this 
argument fails only if (n u n 2 , n 3 , n 4 )' is such that n 4 = 0 and n 2 = n 3 = 0 and 
therefore ri\ = n. Now the probability of obtaining such a sample is 
2 + ey 

4~~J which goes to zero as n —» °°. For the Carver s data the MLE 

® =0035712. 





164 A First Course on Parametric Inference 


Exercise 7.5.1 (i) For the classical model on the frequencies of blood groups 0 a 
and AB obtain likelihood equations and the information matrix J(p, q), F or the d * 
0 : 374 A : 436 B : 132 AB : 58. Obtain MLE ( p , qY and estimate of its asymptot^ 
variance covariance matrix (Rao, 1973). 

Exercise 7.5.2 According to a genetic model the sex and color blindness are assoc 
and multinomial distribution has cell probabilities. 



Male 

Female 


Normal 

Color blind 

Normal 

Color blind 

p/2 

q/2 

p 2 

— +pq 

q 1 /2 


where p + q = l. Obtain MLE of p and the Fisher information I(p). Obtain MLE of q-/ 
2 and its asymptotic variance. Another CAN estimator for q 2 /2 is given by n A ln Compare 
the values of two estimators for the data (442,38,514,6)' and their asymptotic variances. 

Exercise 7.5.6 For the Pareto distribution with parameters (A, a) the data is grouped in 
a< a,< a 2 ... < a*_j < Suppose a is known which without loss of generality can be 
taken to be unity. Set up likelihood equations for the data (n lt n 2 ,..., n k )'. Converting to 
the log scale i.e. using y = log x we have 0 < log a x < ... < log a k < ~ where y has 
exponential distribution with pdf g(y, A) = A • A > 0, y > 0. Set up likelihood 
equations. For a distribution grouped in 3 cells with a x = 2, a 2 = 3 set up likelihood 
equations in both systems and try to solve them for observed frequencies (60 30 10)' and 
determine MLE of A using exponential model. 


7.6 Solution of Likelihood Equations 

First consider the case of a real parameter 6 so that the MLE is determined 

by solving the equation — = o. We have seen that in case of a single 

parameter exponential family, the likelihood equation is equivalent to moment 

equation based on sufficient statistic T(x) = k £ *(*,.) and using results of 
7.2 we have n 

T(x) = i x *(,,.) = E(k W) = _ w (7.6.1) 

U (0) 

in N(f) Tin ^ ^ aVe ^ = itself. For example such a situation occurs 
n ’ >’ b{l 0) ’ Poisson (3) and exponential distribution with mean 0. On 


the other hand in the general case, as *L * o and 4 | is contin 
have r\ 1 , s well defined and 6 = i r‘(T(x)). For example, such a = 
occurs in the Pareto distribution with^x, A) = -4_, x Zl, 2L> 1 

n hi *° 8 x ‘ and - j so that A = =-2— Here rf 

*■ I log x, 



—Maximum Likelihood 165 
tn obtain explicitly. On the other hand there are 

£ c M-' * not » —»■ We consider a * * 

ex aMP 1-E 7.6.1 Consider a Gamma pdf given by 

/( *’ A) = 7(1) ^ e "*>0,A>0 

Here k(x) = log * and JJ(A) = ■ The moment equation based on 

x itself is x = A and X is the moment estimator of A which is CAN with 

MX) = J- 0n the other hand the likelihood equation or the moment 

equation based on log * is 

i 2 log X, = E(log X) = -4- log rxx) = ¥ (X) (7.6.2) 

j 2 dy/ 

As /(A) = l°g T"(A) - > 0, we have y/ is monotone increasing and 

therefore has unique inverse yr l and A = yr l - Z log x, . The MLE A is 

\ n ) 


CAN with 


AV( A) = 


n/(A) n -£i (i °« r < A » 


The function ya(A) = log JAA) is known as digamma function and 

dA 

j 2 

^"2 log JAA) = V^'(A) as trigamma function which occurs in Applied 

Mathematics quite often. These functions have been studied extensively, not 
only for real A > 0 but also for complex values of A as well. 

The function — = y/(A) has been tabulated extensively. The first 

such tables were compiled by E. Pairman (1919) edited by K. Pearson. 
However they define digamma function as log IfA + 1 ). Note that as 

+1 ) = 7777 ) f or o, we have -jr log T\X + 1 ) = -^- + iKX) and values 

f .**) be obtained from Pairman (1919). More re “ nd y Ab ™ m °*'“ 
" Stegun (1964) give values of ¥ (X) for A = 1.00 (0.01) 2.00 from wh.ch 
Valu “ »f fW for A € (0, 1) can be obtained by the recurrence relatton 

* * » = i ♦ ¥ (z). In general ♦ /» = %TTr + ** ^ 

<°’ 1 > and p an integer and thus values of vX* + P) "V integCT P 

°btained. . ’ 

/ ,wstoN jNsr/rw* JV 








166 A First Course on Parametric Inference 


To solve the equation ^ I log x t = \f/(X), we use repeated bisection 

method or use inverse interpolation for the observed value of — £ | 0 

Since X is consistent and X is also consistent it is recommended that we 
first determine an interval x ± h which contains the given observed value of 

“ X log Xi and then use repeated bisection method. If digamma function is 
available in tabulated form we can use methods of inverse interpolation 

Example 7.6.2 Consider a truncated Poison distribution with mean X but 
where zero class is missing. Here the pmf is given by 


fix, X) 


e ^ X x 

- -— “p x = 1, 2, X> 0 

1 - e A x - 



x = 1 , 2 , ..., X > 0 


The pmf belongs to one parameter exponential family with k(x) = x. Now 


dlog/ JC 

BX ~ X 
is given by 



so that E(X) = — - 
e K - 1 


and then the likelihood equation 


x = = V(X) 

e A - 1 

y\ 

Again we have MLE X is given by Tf l (x ). The function if 1 is not explicitly 

available and we must again use numerical methods to obtain the MLE X. 

We determine X\, X^_ such that x e (t](Ai), rfiX^)) and then use repeated 
bisection method. 

The likelihood equation in case of one parameter exponential family has 
a special structure, namely the variables (x h ..., x n ) and 6 are here separated. 
On the other hand if the likelihood equation is such that variables are separated 

i.e. it is of the type T(x) = jj(9) then ~ ^ L = A(0)|T(jc) - 7](0)] where 

A(0) ^ 0 and under usual regularity conditions L(x, 6), the joint pdf of 
(Xj,..., X n ) belongs to one parameter exponential family with T(x) as complete 
sufficient statistic which is MVUE for rj(6). 

Now in case of m-parameter exponential family we have a similar situation. 
If 7’= (T[, T m ) then likelihood equations are T — tj(6), where Tf r iO\, •••> 

is E(T r ), T r = ^ £ K r (xj). The MLE 6 is given by rj _1 (T). Note that 

q j being non-singular, rj~ l is well defined. In multiparameter case 

the method of repeated bi-section can not be easily generalized and multivariate 
inverse interpolation formula are also not easy to apply. 


Method of Maximum Likelihood 167 


In case of a single parameter Crambr family such as Cauchy, the variables 
(*,, X n y and 9 are not seperable and the likelihood equation is given by 

<?logI _ j. 2 (s,- 8 ) 

M i=i [1 + (x, - 6») 2 ] 

This is an algebraic equation of degree (2n - 1) in 9 and explicit solution is 
not available. We then use classical iteration procedures to obtain a numerical 
solution for observed values (jq, x n )'. In Newton-Raphson method we 
start the iterative procedure with T\ as a trial value and obtain successive 
iterations by 



(7.6.2) 


Fisher (1925) proposed a modification of the Newton-Raphson method 


T r+ 1 = T r + 


dlog L 
B9 


nl(9 ) 


(7.6.3) 


Note that Fisher modification of Newton-Raphson method consists in using 

E - 2 ^ I - n I(9) in the denominator, and the iterative procedure given 

by (7.6.3) is also known as Fisher’s method of Scoring or Fisher-Newton- 
Raphson method. The difference T r+1 - T r is known as the correction to T r . 

Since the solution 9 is not known, it is recommended that the iterative 
procedure be stopped when I T r _\ -T r \ < 8 , where 8 is prespecified level of 
accuracy. 

Kale (1961) showed that if T x the initial or trial solution is chosen as a 
consistent estimator then under Cramkr-conditions the iterative procedures 
are valid in large samples. Essentially the result shows that the double limit 

of T ni - 9 n converges in probability to zero as r and n «>. The 
method of proof is similar to the one used in Cram&r-Huzurbazar Theorem, 
and the proof is obtained by showing that the conditions which imply the 
convergence of the iterative procedures are satisfied with probability 
approaching one as n — > °o. Kale (1962) showed that similar results are valid 
for the multiparameter Newton-Raphson method and Fisher Newton Raphson 
method also known as method of Scoring. Iterative procedures defined 
respectively are 


T r+i =T r - 


d 2 log L 
B9 r d9 s 


dlog L 
B9 { 


dlog L 
B9„, 


(7.6.4) 


168 A First Course on Parametric Inference 


(7.6.5) 

e=T r 

Kale (1961) considered the application of Newton-Raphson method 
Method of scoring and some other methods to Carver’s data discussed in the 
previous section. Starting with trial solution as the CAN estimator T l = 

— {n } - n 2 - n 3 + n 4 } = 0.057046. It is shown that for both methods 

T 5 = 0.035713 and the correction I T 5 - T 4 I = 6 x 10 -6 for the Newton 
Raphson process and that for method of scoring the correction is 5 x 10" 6 

A 

For Carver’s data we have 0 = 0.035712 so the error in 5th iteration for both 
the processes is 10" 6 . Kale (1962) illustrates the use of method of scoring 
and Newton Raphson process for the two parameter Gamma distribution 
using simulated random sample of size 15 only. 


T r+i = T r + 


- /■'(») 


dlogL dlogL 


\ 


n 


dQ\ 


do 


m 




Exercise 7.6.1 (1) Draw a random sample of size 25 from Cauchy with /n = 0 from 

f(x, fi) = — j + * x e A* G set up the likelihood equation and use method 

of scoring for parameters with T v = * (13) , the sample median. Note that here 1(0) = I and 
method of scoring is much easy to apply than Newton-Raphson method. ^ 

2. For Carver’s data use the method of repeated bisection in the interval (0, 1) to 
determine 6 correct to five places of decimals. 

3. For Carver’s data as a trial solution use 
4 n 2 


(0 77= 1 - 


n 


(ii) T"= 


An 


n 


and then in each case use the method of scoring and compare the iterations required to 
achieve accuracy to the fifth place of decimal for 0. Compare these iterations with those 

given by initial value T, = ^ {«, - (n 2 + n 3 ) + n 4 ). Compare the variances of T h T{ and 

7 '// 

I • 

4. A biologist obtained a sample of number of eggs in the unopened flower heads of 
the black Knapweed by the Knapweed fly. The flower heads in which no eggs were laid 
were not included in the sample. A truncated Poisson distribution with zero class missing 
was used as a model since the event of a flower head without any egg could not be 
ascribed to the event X = 0 under the complete Poisson model. Obtain MLE of A. 

No. of eggs per flower head 1 2 3 4 5 > 6 

No. of flower heads 96 32 9 7 1 0 

For more details refer to Cohen (1960). 


5. Consider ( X h ..., X„) i.i.d. N(fi, 1). Then the likelihood equation is ^ l0g - = 

dfl 

n(x -n) = 0. Starting with trial value T, = fi. sample median, show that the method of 
scoring as well as Newton-Rphson process gives T 2 = x for any data set (xj, .... x„)'. 


7.7 Asymptotically Most Efficient Estimator 

We have seen earlier that in Cramer family which includes exponential 


Method of Maximum Likelihood 169 


family under certain regularity conditions the MLE 9 is AN (V 
Fisher had conjectured that if T\ ~ AN 


1 




nl(9) I 


then 


AV^) = v { (9)/n > 


1 


nl(9 ) 


(7.7.1) 


and defined asymptotic relative efficiency of T { . ARE (7,, 0) = AV(9)/ 
XV{T\) ~ ~ V] (Q) l(Q) an< * °^ serve d that the MLE 9 is asymptotically most 

efficient as it attains the lower bound which will be referred to as the 

Fisher Lower Bound (FLB). A hotly debated controversy arose between K. 
Pearson and Fisher on the issue of ARE of moment estimator and the maximum 
likelihood estimator. We have already seen that in the common models such 
as N(9, 1), Poisson ( 9 ), b( 1, 9) ^or exponential with mean 9, the moment 
estimator based on X is same as 9 . However in the Gamma distribution with 
shape parameter A and pdf 


f(x, X) = 


1 

r(A) 


e x x* l , x > 0, A > 0 


f 


\ 


the moment estimator X ~ AN{A, A 2 In) whereas A ~ AN 


A, 


V 


1 

nd 2 log r(A) 

W 2 


Since explicit calculation of A and AV{ A ) for any A was difficult, an Indian 
Statistician named Koshal working in Indian Council of Agricultural Research 
(ICAR) observed that the experimental data set indicated that the estimate 


of the AV(X) is much larger than that of AV (A ). 

Today in the light of the knowledge of trigamma function and because of 
CRLB result applied to X an unbiased stimator for A, we know that 


1 


Var (X) = — £ 

n nl( A) 


1 

nd 2 log T( A) 

dT 2 


a °d the moment estimator of A would be less efficient than the MLE A . 

Koshal (1933, 1935) had considered fitting a Gamma density by method 
0 moments and the method maximum likelihood and had shown that the 
method of MLE gives a much better fit than that by the method of moments. 

e evidence was numerical and sample based which Pearson refused to 
acce pt and wrote a strong rebuttal and critique of Koshal in a paper in 
mmetrika (1936). Fisher (1937) defended Koshal vigorously. 
n selected papers of Fisher edited by Shewhart (1950) which contains 


170 A First Course on Parametric Inference 


Fisher’s own comments on each paper, Fisher’s comments on this pap er 
indicate that he found Pearson’s attack on Koshal as ‘flagrantly unfair’ 
Fisher remarks that Koshal was “right on the matter under dispute” and ‘‘in 
grave danger of injury to his position and prospects through the prestige that 
his attacker still enjoyed”. For more details of this controversy between 
Pearson and Fisher we refer to Box (1978). 

We note that the Fisher Lower Bound (FLB) of the asymptotic variance 
of a CAN estimator of 0 is same as CRLB to the variance of an unbiased 
estimator of 0. Indeed Rao (1992) mentions that CRLB was derived by him 
in 1943 in response to the following question asked in his class by a M.A. 
student V.M. Dandekar. “Whether there is an analogous lower bound to 
variance of an unbiased estimator in finite samples?” Later V.M. Dandekar 
became an eminent economist and member of Planning Commission of India 
and headed National Sample Survey for several years. He was a distinguished 
intellectual of India, always raising basic fundamental questions in many 
fields. 

As the turn of the events later showed Hodges and LeCam (1953) produced 
a counter example to show that Fisher lower bound given in (7.7.1) is not 
valid even for the simplest case of sample^from {N(9, 1), 0e^}. Recall 
that for a sample of size n from N(9, 1), X ~ N(9, l/n) is MLE as well as 
MVUE and its asymptotic variance and finite sample variance attains FLB 
and CRLB respectively. We now construct a class of CAN estimators of 6, 
T a for 0 < a < 1 such that 

AV(T a ) = -, V 0* 0 
n 

= cf/n at 0 = 0 

Now for N(9, 1) case FLB = CRLB*= 1 In, and therefore we have a class of 
CAN estimators of 0 for which FLB result does not hold. 

Let (X[, ..., X n ) be i.i.d. N(9 , 1) and let {a„}~ be a sequence of real 
positive numbers such that a n 0 arid Vn a n as n -> <*. For example 

we can take a n = n 1/4 or a n = - etc. Let 0 < a < 1. Define 

T a = x if I Jc I >a n 
= ax if I x I < a n 

Then we show that <n(T a -9)^ N( 0, v(9)) where v(9) = 1 if # * 0 and 
^(0) - <r if 0 = o. To prove this consider the identity 

MT a - 9) = Vn(* - 9) + Vn(T a - x ) ( 7 ' 7,2) 

Note that (7.7.2) holds, V * e and V 0 e R x . 

Now Mx - 9) ~N(0,1 ), V n > 1 and thus V«(Jc - 0) U N(0 ,1). W b6i 
et = in(T a - x) and let 0 * 0. Then 


Method of Maximum Likelihood 171 


P[ I y n I < e] Z P[T a = x] = P[ | x I > a n ] 

= 1 - OiMa n - 0) + [Vn(- a n - 0)] 

For 9> 0, 4n (a n -^9) —» - 00 and Vn(- a n - 6) - oo. Thus P[ I Y n I < e] — » 1 
as * ~ or —M). Therefore Vn(L a - 0) JV(0, 1), for any 9 > 0. 

Similarly for 9 < 0, we can show that Vn (r a -9)^ Af(0, 1). 

For 0 = 0 consider the identity given by 

Vn(7 a ) = Vn(a*) + yj n (T a - ax ) (7.7.3) 

Now Ma *) ~ ^(0, a 2 ) under 9 =0 and thus Vn(a jc N(0, a 2 ). Further 
if Z„ = Vn(T a - ax ) then 


P[\Z n \<e]>P[T a =ax] = P[\x I < a n ] 

- [Vn a n ] - 0[- Vn o„] 

As Vnn„ —> F[ I ,Z n I < £] —> 1 as n —> ©o and therefore under 0 = 0, 
Vn(F a - 0) > N(0, a 2 ). Therefore T a ~ AN(9, v(9)/n) where u(0) = 1 if 


9* 0 and v(9) = a 2 or 0 = 0. Thus T a is CAN such that AV(r a ) = 
with strict inequality at 0 = 0 . 



This shows that the FLB is not valid and T a is more efficient than jc. 

T a constructed by Hodges and reported in LeCam (1953) was called as 

‘supeficient’. The statistical community was surprised at this counter example 

and reacted in two different ways. One was to restrict the class of CAN 

estimators so that FLB result holds. Wolfowitz (1965), Rao [(1962), (1963)] 

followed this line of attack. The other approach was to show that the problem 

of ‘superefficiency’ or improvement of jc using T a in the above way can be 

done at only a single point or finite number of points only. LeCam (1953) 

showed that such an improvement of jc in N(9, 1) case cannot be obtained for 

6e b ) an interval of /?,. Later Ibragimov and Has’minski (1981) showed 

that for a n = n~ m and for the sequence of parameter points 0 = cNn, we 
have 


lim n{E(x - 0 „) 2 I 0„} = 1 

n-+oo 

lim n[(E(T a - 9 n ) 2 1 (?„} > 1 

n-> oo 

. ^al e (1985) obtained the exact distribution of T a . Using the fact that 

th ^ ^ 0r eac ^ n ' one can easil y show that G„(t, 0) = P[T a < t\ 0] 
d f. of T a is given by 

G n (t, 0 ) = <P[Vn(t - 0 )], t < - a n 

= #[Vn(- a n - 0)], - a n < t < - aa n 
= 0[yln(t/a - 0)], -aa n <t< aa„ (7.7.3) 


172 A First Course on Parametric Inference 


= 0[^ln(a n - 0 )], aa n < t < a n 
= 0[^n(a n - G)], a n < t 

G n (t , 9) is everywhere continuous and is differentiable at all t exce 
- a n , - aa n , ota n , a n . The pdf of T a is given by P ^ = 

8n(t, 9) = <p[Vn, (t - 0 )]V« t (- a , v a n ) 


a In 


>In 


f 


a~ e 


,te(-aa„,aa n ) 


= 0 t € (<ta m a„) u (- a„, - cca n ) 

Observe that probability of any subset of (aa a ) u(- fl nn \ • 
under any ft It now follow, ,L ( _ 15 ««» 


any y. it now follows that 

^ 2 !og 8n(t,6) 


de 


- n lf t & (- a n , a„) or t e (- aa„, cca n ) 
= 0 otherwise 


and that Fisher information l Tm (ft) = n and T„ preserves all the information 
in the sample. Thus T a is sufficient. Indeed 


dr. 


dr a 


dx ~ ± 1 if r * (~ a n ) 
= a if 5 e (- a n , a n ) 


dr_ 

oSc 


*0 


= 1 under eac 


and -g- does not exist at * = a „ or - a„. Since /> 

f) J? T f . I ^ | 

i> * a to jc is one-one transformation anH t ic • • i rr* • 

complete. Thus superefficiencv of T 77 “ 18 mlmmal suff,c,ent aI 

illusory ^ ^ ase< ^ on asymptotic variance is somewh 

We can show that for G - + n n _l 

and the reduction in MSEby T »e- of T T’ '7 * 

anamotous LCou?' ? ° intS ‘ h ! -^bourhood of ft = 0 . T, 
that whereas th#» ® com P are d to x may be attributed to the fa 

uni Jm in ft the COnver * ence in d >stribut,on of Mx - ft) to MO, 1) 
uniform in G. rgence ln dlstnbu tion of Vn(T a - G) to N( 0 , 1 ) is n 

method there Tre^ 0 ^ m ^ x ^ mum livelihood and moment and percenti 

discrete dr 17?^ meth ° ds to derate CAN estimators. F 
P ata this includes minimum chi-square where we minimi: 

Pearson = £ (Q/ ~£/(0))^ 

i =i E ( (G) where O,- = observed cell frequency of tl 

value. The minimiza!’'* 0 * 6 * 7' ^ re 0 uenc y in the ith cell when ft is the tn 
minimization of Neyman modification of** namely 


Method of Maximum Likelihood 173 



4- (o, - Em 2 

w o, 


also leads to CAN estimator of 0 under suitable regularity conditions. We 
also mention L, R and M estimators and for the details refer to Serfling 
(1980). However, we will not go into details here. 

The method of maximum likelihood is perhaps most widely used method 
to generate CAN estimators when the model is fully specified except for a 
finite dimensional labelling parameter 0. Even here there are situations such 
as Neyman-Scott problem, truncated parameter space and/or truncated sample 
space where the method leads to estimators which are not CAN and do not 
perform well. We refer to Rao (1962) for such situations. It is emphasized 
that the best course of action is to obtain MLE 0 and study its asymptotic 
distribution by various methods including those based on computer simulations 
as well as bootstrap/subsampling etc. 



Tests of Hypotheses-I 


8.1 Historical Perspective 

In several situations rather than estimating an indexing parameter 6 in the 
model {/(x, 0), 6 e £2}, we are interested in checking whether 6 has a 
specified value. For example if we observe the sex of a new born child such 
that 0 denotes the probability of the event that a male child is bom then we 
may be interested in testing whether 0= 1/2 i.e. parity between the two sexes 
which has infact been a problem of interest since 18th Century. For example 
Laplace studied the ratio of births of boys to that of girls observed in different 
... ^ 99 

cities in Europe and obtained an estimate of 6 = = 0.5116279. The observed 

difference from the hypothetical value 6 = 1/2 is 0.0116279. Does this deviation 
from parity value 6 = 1/2 provide enough evidence against the hypothesis of 
parity? For more details we refer to E.S. Pearson (1978). 

Similarly, Arbuthnot (1710) (Kendall and Plackett, 1977) observed that in 
the city of London during the period 1629 to 1710, each year the number of 
male births exceeded the number of females. Implicitly using a model of 
Bernoulli series of trials Arbuthnot calculated the probability of such an 

event as (1/2) 82 and using logarithms evaluated the same as —^ x 10~ 21 

— (2.0678246)10 Arbuthnot argued that this probability is extremely 
small and therefore the regularity with which the number of male births 
exceed the number of female births may be due to ‘Divine Providence’. On 
the other hand Laplace argued that the observed deviation is significant in 
that the probability of such a deviation from hypothetical value is very small 
and the deviation may be due to ‘regular causes’. In another context Laplace 
studied further the phenomenon of excess of births of boys over those of 
girls” and claimed a specific cause namely the practice of abondoning newly 
bom girls to Foundling Hospitals in many cities in Europe 

The above thinking was first used by K. Pearson in proposing the well- 
known chi-squared test of goodness of fit, and later by Fisher in proposing 
various tests of significance. The logic here according to Fisher (1959) was 
that of a simple disjunction namely “Either an exceptionally rare chance has 
occurred or else the null hypothesis is not true ” 

The question arises how to define “an exceptionally rare chance" or an 
event of small probability. One could do it in a normative way i.e. by prescribing 
a level such as 5%. 1% etc., and then determine a cutoff point in the tails of 



Tests of Hypothesis-I 175 

the distribution of a statistic T which is used as a test statistic. For example 

Laplace used 9 = —— and Arbuthnot T - number of years in which the 

number of male births is larger than female births. Under the hypothesis of 
parity i.e. 9 = 1/2, Laplace model uses T = n9 ~ B(n, 1/2) and Arbudhnot 
T ~ 9(S2, 1/2), and the tails of the distribution of the test statistic n9 and 
T can be defined in a natural way. If we plot the pdf of the test statistic tail 
areas get defined in a natural way. On the other hand why one should restrict 
to tail areas only is not very clear. If exceptionally rare chance event is 
defined normatively by 5% or 1% level, then any event with probability less 
then 5% or 1% level would be eligible to become ‘an exceptionally rare 
chance event’. 

For example in the Bernoulli series model used by Arbuthnot every sequence 
of 82 trials has same probability 1/2 82 but one would not suspect a sequence 
in which 41 times male births exceeded female births whereas in remaining 
41 years female births exceeded the male births. The total number of such 

sequences is . Among such sequences should one suspect the sequence 
V 4 v 

in which we have the difference between male and female briths is alternately 
positive and negative? In a similar manner suppose that in the first 41 years 
male birhs exceeded female births and in the next 41 years reverse situation 
is observed then should this be regarded as a rare chance event? Similarly 
in a coin tossing situation a captain that wins all the toses in a series of games 
is regarded as lucky but as per the Bernoulli series model every sequence of 
outcomes has the same probability say 1/2" where n is the total number of 
games. Savage (1976) reports about an informal discussion which he had 
with Fisher on this matter. Fisher regarded the issue as an academic one and 
felt that which events are to be regarded as rare events is clear from the 
scientific context on which the model is based. Thus the values in the tail 
areas of the distribution of the test statistic are regarded as significant and 
not those in the central part of the distribution. In the problem of testing of 
goodness fit, the large values of belonging to right tail are regarded as 
significant not the small values of x 2 although one could argue that “the fit 
is too good to be true”. 

This being a first course in parametric inference, assume that the given 
model {L(jc, 9), 9 e £2} is true and consider the problem of testing of 
hypotheses in this restricted context only. Then for an event E we can compare 
p e(£) under different values of 9 and define E to be event of small probability 
in terms of P£E), for fixed E and variations in 9. This naturally leads to 
comparison of the likelihood L(x, 9) for fixed data x and variations in 9. This 
a Pproach will be considered./typically then the problem is posed as follows. 

We have a sample of size n from {/(*, 9), 9 e £2} and to test whether the 
°* ata x supports the hypotheses 9 € £2j a subset of £2 or 9 € Gli = £2 - £2j, 



176 A First Course on Parametric Inference 


the compliment of £2, relative to £2. Note that we could view the above 
problem as that of estimating y< 0 ) by taking y< 0 ) to be indicator function 
of £ 2 , i.e. y<0) = 1 if 0 e £ 2 j and zero otherwise. However the theory 0 f 
MVU estimation does not apply in most cases of interest as U ¥ is usually 
empty if £ 2 ) is large as the following example shows. 

Example 8.1.1. Consider a problem commonly occurring in SQC. Out of 
a large batch of production a sample of size 20 is drawn and the item is 
inspected and classified as defective or nondefective. We assume that the 
model is such that (X h ..., X 20 ) are i.i.d. b( 1 , 0 ) where 0 = />[*, = 1 ] = 
Pfi-th item is defective]. Suppose £2, = (0, 0 O ] where 0 O is a specified tolerence 
level of percentage defective such as 10%. Then £2 2 = (0 O , 1). Now X 2Q i s 
MVUE, and CAN for 0 as well as it is minimal complete sufficient statistic 
for subfamilies {L(x, 0), 0 e £2,} and {L(x, 0), 0 e £2 2 }. If possible let 
(p(x 20 ) be unbiased for y<0) = 1 if 0 < 0 O and zero otherwise. Then using 
completeness of subfamilies we must have (pix 20 ) = 1 as well as (pix 20 ) = 0 
for all values of * 20 which is a contradiction. This shows that is empty 
and MVU estimation approach is not possible. Further y< 0 ) has zero derivate 
at all points 0* 0 O and at 0 O , y/( 0 ) is not differentiable. Therefore we can not 
obtain a CAN estimator of y(9) based on X„ using the techniques given in 
earlier chapters, although X n is itself CAN for 0 . 

On the other hand if £2 is too small say consisting of two competing 
models given by distribution functions and F 2 {x) only, then U ¥ is 

too large and then also MVUE approach fails as the following example 
shows 

Example 8.1.2 (Rao, 1973). Suppose we have a sample of size one on 
f(x, 6) = Ox0 <x< 1 , 0> 0 and £2 = { 1 / 2 , 1 } i.e. it consists of two points 
only namely 0 = 1/2 and 0=1. Thus we have two competing models given 
by/C*, 0 and/(*, 1/2). Let £2, = {1/2} and £2 2 = { 1 }. Consider (p m (x) = 
ax m + b, then 


E[q> m (x) 10 = 1 ] = 


a 


m + 1 


+ b if m > - 1 


and 


E[<p m {x) I 0 = 1/2] = 


a 


+ b if m > - Ml. 


2m + 1 

Therefore for m > - 1 / 2 , (p m (x) is unbiased for y< 0 ) the indicator function of 

m + i + b = 0 and ^ + b - 1. Solving these two equations for 

a and b in terms of m, we have for m * - 1 or m* - Ml or m * 0 


y.(x) = - fr * l)(2m + [) x- + 3gL±I 


m 


m 


( 8 . 1 . 1 ) 


is unbiased for yr( 0 ). 

Now under 0 = 1, we have 


Tests of Hypothesis-/ 177 


Var [(pjx)] = E[(pl(x)] 

(2m + l) 2 f' ri 

” ~2 — J Q P ~ 2 (m + l)* w + (m + 1 ) 2 jc 2 "'] dx. 

As m > - 1/2 the integral above exists and 

v ar[<p,„(*)] = ( 2 m+l). ( 8 . 1 . 2 ) 

Now consider m -> - 1/2 then (8.1.2) shows that inf Var [(p m (x)] = 0 and 
thus zero is the lower bound for any T e U However we cannot have 
any T e which attains the lower bound zero as E(T I 0 = 1 ) = 0 and 
E[T I 0 = 1 /2] = 1. Therefore there does not exist any MVUE of y/(0) although 
Uy is not empty. 

In the parametric set up that we have assumed in this course it is more 
natural to pose the problem as that of testing hypotheses H\ \ Be Q.\ against 
H 2 : B € £2 2 = ^2 - £2i on the basis of X = x assuming that the distribution 
of X is specified by the model {L(x, 0); 6 e Q). In this formulation the 
important characteristic of the procedure is the two types of possible errors 
namely (i) false rejection of H\, equivalent to false acceptance of H 2 and (ii) 
false acceptance of H j equivalent to false rejection of H 2 . Thus the emphasis 
would not be on defining events of small probability but study the probabilities 
of two types of error. 

In the next section we will illustrate this approach and consider the problem 
from the above view point. 

8.2. Critical Regions and Test Functions 

Let (X[, X 2 , ..., X„) be a random sample of size n from {f(x, 6), Be Q). On 
the basis of the data x = (jq, ..., x n ) we have to decide whether H\ : 6 e 

holds or H 2 : Be Q 2 = holds - 0ne of the sim P lest wa y t0 define such 
a procedure is to divide the sample space into two mutually exclusive sets 
say Ei and E 2 = E\ such that if x 6 E\ then we accept H\ (or equivalently 
reject H 2 ) and if jc e E 2 = E\ then we accept H 2 (reject //,). Then the two 
error probabilities are 

(i) P£Ei) for 0 e a 2 : false acceptance of H { 

(ii) P^E 2 ) for Be £2i : false rejection of H\. 

Note that P^E 2 ) = 1 - P^E X ) as (E\, E 2 ) is a disjoint partition of the sample 
space. Thus it is enough to study either P^E X ) or P<fE 2 ). Note that P^E 2 ) 
•s the probability of rejecting when B is the true value of the parameter. 
Thus if we define a subset C„, of the sample space as a critical region for 
i.e. if jc e C H . then we reject H\ otherwise we accept it. Then the test 
Procedure is defined by the critical region C Hl or equivalently i* indicator 
function (p{x) = \ [f x e C W| and zero otherwise. Now P^C Hx ) - E^fffx)) 



178 A First Course on Parametric Inference 


and the function P^C H] ) for 9 e represents the error of false rejection 
of H x and 1 - P^C Hl ) for 9 e £2 2 represents the error of false acceptance 
of H { since when x g C Hl we accept H x . One could also formulate the 
problem in terms of a critical region C Hl which rejects H 2 when x e c h 
However C Hi = C C H{ and thus it is adequate to formulate the problems in 
terms of either C Hx or C Wv So far there is a perfect duality between H h // 
and C Hv C„ 2 . We will therefore formulate the problem in terms of H x and 
C w , only. We first consider an example. 

Example 8.2.1. Consider the situation occurring commonly in SQC as given 
in Example 8.1.1. Let H x be the hypotheses that 9 e £2] = (0, .1] and H 2 be 
the hypotheses that 6 e Cl 2 = (-1. 1). A critical region for hypotheses //, (or 

equivalently an acceptance region for H 2 ) can be defined in terms of T = X X,. 

For example let C 3 = {3, 4, ..., 20}. Thus if observe T in C 3 we reject 
and say that the batch is not of good quality i.e. 9 > .1. This would imply 
rejecting the batch and returning the same to the manufacturer. 

Now the critical region for H\ given by C 3 can be described as [T> 3]. 
Error probabilities are given by P e (C 3 ) for 9 e = (0, .1] and 1 - P^C 3 ) 

for 9 € Q 2 = (.1, l). Consider now some other possible critical regions 
C 4 = [T> 4] so that 


C 3 - [T > 3] - C 4 u [T = 3] or C 2 = [T > 2] i.e. C 3 u [T = 2] = C 2 . 

from rl C< ff 3 eC " Then the followin * table »h,ch can be obtainec 
„• r T n b f ° Dlstnbut,ons g' ves error probabilities of false 

a^r= 20 3o"and r 0 . aCCePtanCe Val “ eS ° f * = 01 ’ 05 a " d 1C 


Error probabilities of false 
rejection of H x 


Error probabilities of 


C 4 

Q 


0 = .01 

0= .05 

O 

11 

0= .20 

V. 

6= .30 

/I 11] 

0 = .40 

.000 

.001 

.017 

.016 

.075 

.264 

.133 

.323 

.608 

.441 

.206 

.069 

.107 

.036 

.008 

.016 

.004 

.001 


1 -r e (Z fore >'.l 3 'is 4 ane™r f of fafse acceptance ^ fa ' Se rejeCti °" 

falsest,on Q the probabili ‘ : 

other type, namely probability of false accent! ^ f r ° bablllty of err0 
other hand if we compare C 3 and C. then n Pt"?® ° f Hi 1S hi S her * 0n 

is smaller for C 4 than that for C 3 but the ° f false re J ection ol 

P ability of false acceptanci 




Tests of Hypothesis-I 179 

u, behaves in the opposite manner. The rearW „,;u • * , 

a general phenomenon. Le, C reC ° gniZe “ 

that CaD. Then to 9« £2„ P dQ < P dD} and for gTaJTZT> 
1 - P e (D) or probabilities of errors of two tvnec K^Ko 2 ’ > P ^ C) ~ 

if one decreases then the other increases. Thus by enllrgmgOT shrTnUnga 
give „ cnttcal region we can decrease one type of error probability a the cos" 
of increase in the error probability of other type y 

r U Tn ZnUuTl To? Pr ; b ' em Wh6re We ha - a random sample 
of sl “ !{*’. d) ’f, Q) and we want ‘o test Hp.Se Q, against 

* n r ' *a o p C !; ,tical region for K, and A c its acceptance 
region. Define p(A 0) = P(x e A 0 ) then p(A, 0) for 9e a, is the error 

of false rejection of H, and 1 - fKA, 0) for 6 e K the error of false 
acceptance of An ideal test procedure would be one for which (KA, 0 ), 
0 e a, and 1 P (A, 0), 0e Q 2 are minimized simultaneously by selecting 
^ in a proper way. However, even in the simplest problem where Qj = {0,} 
and = { 0 2h we cannot select A such that both /3(A, 0,) and 1 - p(A, 6 -,) 
are minimized simultaneously. This follows from the fact that (5(A, 0,) can 
be reduced only by removing sample points from A i.e. shrinking A,’ but this 
results in reducing /?(A, 0 2 ) which increases 1 - /3(A, 0 2 ). On the other hand 
1 - $A, Q 2 ) can be reduced only by increasing j3(A, 0 2 ) or by enlarging the 
set A which increases fi(A, 6 { ). 

To get out of this dilemma we weigh the importance of the two types of 
errors based on other considerations, and try to control that error which is 
more important and then subject to this constraint, minimize the error of the 
other type. Thus if we regard false rejection of H x as more serious error than 
false acceptance of then we control j3(A, 6 ) for 0 £ by specifying a 
level a i.e. /3(A, 6) < a, V G e Q, and subject to this condition, minimize 
1 - P(A, G) for each 6 e Q 2 by varying A. On the other hand if we consider 
the error in false acceptance of H\ is more serious then we must have 
1 - /J(A, G) < a, V 6 g £2 2 a °d subject to this constraint, minimize j3(A, G) 
for each 0, e Q, by varying A. This is the classical formulation of the 
problem due to Neyman and Pearson (1933). Their fundamental contribution 
was t0 show that there is a paradigm by which the best critical region can 
be determined such that one of the errors is controlled at a preassigned level 
a an d subject to this restriction the other error is minimized. 

However in such an approach the perfect duality between //, and H 2 is 
lost - T, o emphasize this asymmetry between H x and H 2 that we have introduced, 
We denote by // 0 that hypotheses among H\ and H 2 the false rejection of 
^hich is regarded as the more serious error and call it the null hypotheses, 
he other hypotheses will be called as the alternative hypotheses and will be 

denoted by H A . 

in ^ problem of choosing H 0 the null hypothesis out of //, and H 2 may 
o ~° Ve Var iety of considerations. For example consider the illustrative Example 
where 


180 A First Course on Parametric Inference 


H i ' Batch is of good quality which is equivalent to 6 < .\ 

H 2 : Batch is not of good quality which is equivalent to 0 > 1 
From the consumer’s view point rejecting H { when //, is true i e 
a Batch when in fact it is of good quality is less serious an error ?? 
compared to accepting H { (rejecting H 2 ) when H ] is false (// 2 i s true) t? 
here H 0 should be taken as H 2 . From the manufacturer’s view point reie ' ^ 
a batch when in fact it is of good quality may result in a consider!? 
inancial loss. On the other hand rejecting H 2 when H 2 is true, i.e. market 1 * 
a atch of bad quality may result in higher profits. Thus from the view nr? 
of manufacturer it is tempting to select H 0 as H x . However marketing^!! 
a c es too often may lead to loss of reputation and eventually reduce h 1 
share of the market Thus if the manufacturer has short term taems, 

But f 8 h qU u 1Ck , m0ney and then leaving the market ' he ma y select H, as H 
Buuf he has long term interest then he would take H 2 as H 0 ' * 

rl:;;s P :: d “S s -t T'fz ^ been made - * 

* = * aris ing out of the mode°l {L(x ^ ^ /q. * b f aS1S ° fdata 

critical region C for H a , then we re^if ^ c and ft^i f' 7 T 
the error probability of fake reWtirm a • „ , ^ for 9 e Qq is 

probability of false acceptance is called"! Tvoen “ ^ Thee ™ r 

D «=(CI^C)Sn,V 0 6 o #| ( 8 . 2 . 1 ) 

Among B a we want to select C* surh that i n 
o e a A Which is same as determining C* such thal } £ ’ ~ ^ ** a " 

UC ) ~ P ^ C> ’ V C 6 ID « and V e s a A (8.2.2) 

The prescribed number a is called 1 i 
procedure and sup P dQ its size ^ d ‘ he 0 leVe ' ° f sig " if — of the ted 

ff > ' a 6 iJ o is called the size function 
and is denoted by p(C, 0). Similarly ft, ir\ f„ • 
power of the critical region G at 0 - a l 3 glven € is called the 
power function of C. Indeed we can = P(C ' 6) for 6 € Qa th6 

Critical Region in terms of the function problem of determining Best 
all critical regions C of level a such ^ [C ’ for 6 « u &A- Among 


determine C* such that 


& c < V) £ a, V 6 s 0, 


power of c* is maximum at each point s s jj. 


a 


Tests of Hypothesis-I 181 


We can slighlly ge”' rali “ *e idea of a critical region by defining a test 
function *(*> such that (1) 0 < <p(x)< 1 and (2) <p (jr) is the conditional 
'“"ability of rejecting H 0 when a; is the data. Note that if <p(x) = 1 if * e C 
!ld zero otherwise then <p(x) corresponds to a test defined by the critical 

Ion C- tes !/ unC0 °" allows “ s “> extra freedom, namely rejecting 
M with a probability <p(x) when x is observed rather than either rejecting H 0 
"“accepting Ho with no option m between. Now £„(,>(*)) = jj (8 ) is called 
as the power function of the test defined by <p. /3 f (e,) for 0 , s Q A is the 
noW er of the test at 0, and j3 ? (0„) for ^ e Q,, i s the size of the test at 
foe Qo -The test ^ is called as the level “ ,est if 

sup < a «=> p ,,(0) S«V,6 sO. (g 23 ) 


Let ©a denote the class of all level Of tests. Then <p* is Uniformly Most 
powerful (UMP) level Of test if (p* e D a and 

V<p<= B a , V OeU A ( 8 . 2 . 4 ) 

Our object is to determine the UMP test and in the later sections we will 
present a constructive method of obtaining UMP tests when such tests exist. 
We will start with the problem where = W and = { 0 !} i.e. 
discriminating between two completely specified distributions with pdf of X 
given by L(x, Of) under Hq or L(x, 0 } ) under H A . This problem is known as 
testing a simle null hypothesis against a simple alternative. This is the simplest 
problem and as we will show later it always has a solution. Since £2 A consists 
of a single point only, <p* the best test for such a problem is called as the 
Most Powerful (MP) test rather than UMP since for uniformity we need 
several distinct values 6 € £2 A . We note that it is also common to denote the 
alternative hypotheses by 0 e fit! and we will use H A and H { and Q A and 
ill interchangeably. 


Exercise 8.2. (i) Let n chickens be selected at random and let (Xj, ..., X n ) denote their 
initial weights. Let the animals be provided a special diet D for a period of 4 weeks and 
* et (fi» Y n ) be their weights at the end of treatment period. Denote by Z, = Y, - X, 
which represents gain in weight due to special diet D. Suppose that (Z (> .... Z„) can be 
treated as i.i.d, N(6, 2) and suppose we can recommend special diet D if the gain is at 
least 2. For n = 18, Z ~ N(6, 1/9) consider the critical regions C\ = {z I l > 2 + 1/3}, 
2 s {z I z <2 + 2/3} for H { : 0 > 2 and H 2 : 6 < 2. Calculate error probabilities of C\ 
and C 2 for 6 = 1, 1.5, 2, 2.5, 3. Discuss how you will select Hq and H A . 

ii) Suppose that the life of an electronic item is exponential with mean 0 and let 
~ > r 0 l = e~'° ie . For issuing warranties up to time to, we want to test whether 

, . j; 1 ~ 8 where S is sufficiently small. Then ff, : e~^ 6 > 1 - 5 and : e^ ,e < 

i n rans ^ err ‘ n 8 these hypotheses in terms of 6 by taking natural logs we have H\ . 
' o/[( ~ lo gO -<5)] = 0 Q and H 2 : 0</</[!-log(1 -$)]• Let t 0 = 30 and 8= . 1 . Propose 


a e critical region based on T = X X, - G (20, 0) and obtain its error probabilities 

ate-i,, 3 . 1=1 

' 2 4 ®o. 9 0 ,1 0q, 29 0 . Discuss how you will choose H 0 and H A . 



182 A First Course on Parametric Inference 


8.3 Neyman-Pearson Lemma and the MP Test 

Let (Xj, X„) be a random sample of size n on X and let H 0 : x ha 
M.x) and : X has pdf/^x). Note that the joint pdf of the sample under^ 

n n 0 

is Lq(x) = II fo(xj) for xe S 0 and under Hi it would be L x {x) = p f { ( X ) f 0 

« ^ r .1_o_i r ___ r v / \_i t / \ 


where 5 0 and S { are supports of L 0 (x) and L { {x) respectively. Let 
be a test function then 


9 


size of (p = £[qp I H 0 ] 
power of (j 9 = E[(p I H x ] 



L 0 ( x ) 
Lj (a:) 


(8.3.1) 


Let D a = {(p I £[<p I < a} denote the class of all level a tests. Then to 
obtain the MP test of level a we want to maximize E[q> I H{\ by varying <p 

over D a . The famous Neyman-Pearson Lemma shows constructively that 

the MP test <p* can be determined for any ce e [0, 1] and <p* is essentially 
unique. 

Now consider x e n Sf i.e. where j: does not belong to support of 
Loix) and L\(x). Such points do not contribute to the size or the power and 
therefore can be excluded from consideration. Let xe S‘ n 5, i.e.forwhich 
Lq(x) - 0 and L { (x) > 0 then we must define <p*(x) = 1 as this increases the 
power to the highest possible extent without increasing the size of the test. 
Similarly if x e S 0 n SJ i.e. for which L 0 ( x) > 0 but L,(x) = 0 we must 

e me <j£> (x) - 0 as this does not affect the power and at the same time 
decreases the size to the maximum possible extent. The points x e 5 0 n 5, 

• W > ant * (*) > 0 need a careful consideration. The fundamental 

lemma of Neyman and Pearson shows us a way for defining <*>*(*). 

THEOREM 8.3.1 [N-P Lemma (Part A)] 

(i) Any test (f>]fx) of the form 


r av / 




- Y(x) if L,( x) = kL, 
= 0 if £,(*)< jfci, 

fesUnTw 0 • yV “ a " d ° S Y(X) £lisa MP test of 

"me £ TrT ^ UX) against *• ■■ X 
t ) The test <pjx) corresponding to k = ~ j s g iv , 

«*•(*) = 0 if m x) > o 
= 1 if L,,(x) = 0 

and is a MP test of level zero. 


(8.3.2) 


E[<f> t (x) I Hoi for 
pdf Lfx). 



Tests of Hypothesis-1 183 

proof: Note that the size of (p k = E[<p k I H Q ] in this case is same as the level 
of fr qf be any test with level E[<p k I H 0 ] i.e. E{ qf I H 0 ] < E[q> k I H 0 ], 

then we show that E[(p k I //j] > £[ 9 ' I to claim the result (i). Define 

Q( x ) = [<P/c(x) - <p (x)] [Lj(x) - IcLq(x)]. 

For x e E h where E { = {x I L^x) > fc L 0 (x)}, 
since 0 < q>\x) ^ L we have 0 (x) > 0 . 

For x 6 E 2 = {x I Li(x) = k L 0 (x)}, e(x) = 0. 

For x e £3 = (* I Li(x) < /: L 0 (x)}, as (p k (x) = 0 and 0 < qf(x) < 1 

we have £*(*) > 0 . Therefore J Q(x) dx> 0 which implies that 

E[(p k I - E[(ff I HJ > *{£[<& I tf 0 ] - £[<p' | tf 0 ]} (8.3.4) 

As <p' is level E[(p k I Hq] test, RHS of (8.3.4) is non-negative. Therefore 
E[(p' I H{\ < E[(p k I H\] and (p k is MP test of level E[(p k I H 0 ]. 

To prove (ii) observe that E[(p„ I H 0 ] = 0 and any level zero test qf must 

have its size also zero or E[qf I H 0 ] = 0. But E[(p' I H 0 ] = q>\x) L 0 (x) dx 

Js 0 

and 0 < qf < 1 and Lq(x) > 0 on S 0 . Therefore we must have (p (x) = 0 on S 0 . 
Now 

E [<?„ I H\] - E[qf I Hi] = f (<p«, - Lj(x) fix (8.3.5) 

J5! 

Now ^ - q> = 0 on 5 0 . Further S\ = (S { n Sq ) u (Si n S 0 ) therefore RHS 
of (8.3.5) 



(1 - (p') L\ (x) dx 


>0 


as 0 £ qf < l and on S ] nS^,L l (x)> 0. This completes the proof of (ii) of 
Part A or N-P lemma. 

We next show that for any a e [0,1] there exists a test of the form (8.3.2) 
with y(x) = y a constant such that E[(p k (x) I H 0 ] = cl i.e. For given a we can 
determine a corresponding k and 7 so that the test <p*Cx)t by T eorem 
s ^P test of level or. 


I^OREM 8.3.2 [N-P lemma (Part B)] .. 

F ° r every a, 0 < a S 1 there exists a test %(*) of the form (8.3.2) 

Ax) = ysuch that £[%(*) I H 0 ] = a. 




184 A First Course on Parametric Inference 

Proof: Let a = 0 then we can take k = ~ and (pM) is MP test of level z ero 
as seen earlier. We next show that for 0 < (X < 1 we can select constants k 
and y such that there exists a MP test of the form (8.3.2) with y{x) = yof 


size a. 


E[(p k (x)\H 0 ] = f (p k (x)L 0 (x)dx = f (p k (x)L 0 (x)dx+y \ L 0 ( x )dx 
JSo Jb 

where A = S 0 n E\ and B = 5 0 n E 2 . Note that on S 0 n £ 3 we have (p k ( x ) = 
0 . On A and B, L 0 {x) > 0 and if we define a r.v. Y = LfX)IL 0 {X) on (S 0 n A) 
u (S 0 n B) then the above integrals are P[Y > k I H 0 ] and yP[Y = k I tf 0 ] 
respectively and thus 

E[(p k (x) \H 0 ]=l-P[Y<k\H 0 ] + yP[Y=k\ H 0 ]. 

We must thus determine k and y such that 

P[Y<k\ H 0 ] - yP[Y =k\H 0 ] = l-cc. 

If there exists a k 0 such that P[Y < k I H 0 ] = I - a then we take r = 0 and 
^ - k 0 - It such a ko does not exist then there exists a k x such that 

P[P < k { I H 0 ] <] - & < p[y < I H 0 ]. 

Then we take k = k, and Y= 1 ] — (1 — or) 

P(Y = k } \H 0 ) • For this pair (k, 

we have E[(p k ( x ) | H 0 ] = a. 

byjartAofN-PkmmUMPteTtof it s ™z' Tu f °™ (8 ' 3 ' 2) WhiCh 

(A) and (B) of N-P lemma oh , tS S1Ze whlc ^ 1S Its level also. The parts 
level a test which Is given by(83 2 T Lat"^ a " y * e H there exists a MP 
MP level a test can differ from th ‘ L ^ Part C we show that any other 
{x I Lj(jt) = kL 0 ( x )} only and the f ° ne glven by (8.3.2) on the set E 2 = 
before this we consider a few examples ^ ^ ^ essentially uni ^ ue< But 
Example 8 .3.1. Let (A", y \u 

distribution with the mean ft e a ran( l 0 ni sample of size n from exponential 
where e, <T 0 . ** We w “. to test H 0 : L * vs 


Now ill*) > 


L o(x) UJ exp \T 0 ' 0 ;} where T = X *, 


1NOW ~L±Zji > , 

A> (x) < k according as T = c sinre f °o Y . 
fj_ 1 \ > 1^1 is a positive constant ant 

v 0 o 07J < Further P 0 A (^ _ 1 

e Qual to ^^[ 7 ’ _ a-r, L £ oU) - 0 for any as this probability 


Tests of Hypothesis-l 185 


define k or equivalently C such that P&T< C\ = a or T _ edt = a 

Th us C = where y „, 0 is the 100 a% point of'th^mma distribution 
with shape parameter n. 

« a " e hy ( p0t !' eSis is «i ; «== <h where 0 , > 0„ then we would 
reject Ho ^ T > c i e - tor lar g e values of T where C - 0 o y . We observe 

that the MP test depends on the observations (*„ .... xfo^ “h 

minimal sufficient statistics T = 1 X h Further as T is a continuous r.v there 
is no need tor randomization on the set 


E 2 -{x\ L { (x) = IcLq(x )} = {a: I T(x) = C}, 

as this set has probability zero under 0 = 0 O . Note that the constant C = 0 o y„ 
or C = %y,u\-a depends only on 0 O and whether the alternative is 0 ] < 0 O or 
0 | > 0 O i.e. the relative position of 0 { w.r.t. 0 O and not on the exact value of 

0 i- 

Example 8.3.1 . We consider now an example in which X is a discrete so that 
L (jc) 

^ ^ = Lq (jc) * S a ^ S ° ^ scre * e an ^ we ne ed a randomization on the boundary 


E 2 — [x \ M,x) — k}. Consider the case when we have a sample of size one 
and pmf under Hq and are given by 


H 0 :P[X = x] = —,x = 0 , 1 , 2 , . 


and 


H a :P[X = x] = Uj\, * = 0 , 1 , 2 , 


Thus under H 0 we have geometric distribution with 0 = 1/2 and under H A it 

is geometric with 0= -7. Now 

4 


X(x) 


L l(*) _ 1 , x / 1 

L 0 (x) 4 X+I / 2* +1 



which is equivalent to x = c . The M.P. level a = .05 say test is given by 
selecting (c, such that 

P[X > c I // 0 ] + /P[X = c I H 0 ] = a= .05. 

Now for// o: />[* > f i _ £ _L = _1_ Thus if there exists an integer r 0 such 

J r=f 2 ,+l 2 ,+1 

= -05 then we select c — and y— 0. However for (X = .05 — -^q 

* e have ^ ^ < -L and P[X > 3] > jjy, but P[X > 4] < i Thus 

e se| ect r„ = 4 and fmd y such (hat P[X > 4] + y P[X = 4] = .05, i.e. 



186 A First Course on Parametric Inference 


_L . X - _L or r= 41- Therefore MP test of level a = .05 is K i 


32 32 20 


given by 


(p(x) =1 if x = 5, 6, 7, ... 


= | if* = 4 


= 0 if x = 0, 1, 2, 3. 


The power of this MP test is 


+ |m(i 


_ nr + 2 nr i 

/ 'UJ + 5UJ4- 

Example 8.3.3. Let X be Cauchy with location 6 and known scale a- 1 and 
H 0 : 6 = 0 and H x : 0 = 1. We show that the test <p( jc) = 1 if 1 < jc < 3 and 
zero otherwise is MP test of its size. Note that 


X(x) = 


h(x) 

Lo(x) 


1 + x 2 

2 + x 2 - 2x 


From NP Lemma (Part A), if we show that the given test tp(x) corresponds 
to a value k then the result is proved. Since A(x) is a continuous function, the 
region Mx) > k is defined by x 2 (l - k) + 2kx + (1 - 2k) > 0. The set E 1 = 
{x I X{x) = k } consists of two roots of the quadratic equation jc 2 (1 -k) + 2kx + 
(1 - 2k) = 0 and P(E 2 ) = 0. Thus we need to show that E\ = {* I 1 < x < 3] 
corresponds to a value of k such that {x I x\\ - k) + 2kx + (1 - 2*) > 0} = 
£i. Now another way to describe E x = {x I (3 - jc)(jc - 1) > 0} or £, = 

* x + ^x - 3 > 0}. Comparing two quadratic equations if we take k- 2 

t en (p 2 ( x ) corresponds to the critical region (1,3). The level of this test is 
same as size given by 


P[1 <X<3I 




, irj?) 131 -'“ J <™ 


The power of | w i _ 1 f 3 dx 

*J| l + (x-l) 2 

=¥ r a (2) - 

K Jo 1 + u 2 It 

'naieconteMofthlTh?’ ** maximum tolerable type I error is pres 
corresponding (k ^° b , em t and we Pennine the MP level test by deten 

P "“mg (*, y) whlch we wil , deno(e by ^ M see „ fr0 





Tests of Hypothesis-I 187 


Theorem ° f « Spends 

the distnbution o !(X)/L 0 (X) under // 0 . We now give below a few 
xamples to show that the distribution of Funder H 0 can be complicated and 
consequently the determination of (k a , y a ) may be not so easy. 

Example 8.3.4. Consider a random sample of size one and we want to 

obtainMPtest fovH 0 :X~N{0, l)vs/fj :X~ C(0,1).Then L, (x) = 1 - 1 

1 _ 2 , n \ +x v 

x e /?! and L 0 (x) = -^== e x /2 , x e Note that here S 0 = S l = /?, and 

A (*) = *i^ K ' ^(*) depends on x only through x 2 , we have 

A(^) = Mr x)> V x e R } and if x e E x = {x I A(x) > k } then- x g £, and E x 
is symmetric about origin. Again P( A(x) = k I //q) = 0; since for any fixed 

/ n \ 


k, exp 


2 J “ k<yj 7t/2(l + x ) holds for only a finite number of points and 


can not hold for x in an interval (a, Z>). Thus the MP level oc test is given by 
exp {x 2 /2} 

<P*W = 1 lf \ , 2 > c ’ where c 1S determined by E[<p k (x) I H 0 ] = a. 

1 T A* o 

To obtain the critical region corresponding to <p k (x) we study the function 
exp {£ll}l(\ + x 1 ) in detail. We can restrict attention to x > 0 and consider 
equivalent variable y = x 2 . Let u(y) = exp (y/2)/(l + y), y > 0. Then 


u'(y) = - exp (y/2)/(l + y) 2 + \ exp (y/2)/(l + y) = — ? (y - l) 

z 2(1 + y) z 

«'(y) = 0 at y = 1 and u'(y) < 0 for 0 < y < 1 and u'(y ) > 0 for y > 1. Hence 
«(y) for y > 0 attains the minimum value at y = 1 a local maximum at y = 0 
and goes to <» as y «>. Returning back to x we have A(0) = .7979, A(l) = 
•6527 and A(x) is decreasing in (0,1) and increasing in (1, <»). The behaviour 
of A(x) for x < 0 is exactly the mirror image having minimum A(- 1) = .6527 
a nd A(x) —> as x —> - oo. 

The horizontal line A(x) = k intersects the graph of A(x), 

(0 in two points if k > .7979 
(») in three points if k = .7979 
N in four points if k e (.6527 .7979) 
m two points if k = .6527 
W m no point if k < .6527 

s the MP test (p k (x) has size given by 

E[<Pk(x) I H 0 ] = 1 if 0 < k < .6527. 

< «27, .7979) then 

£[ «*) IH 0 ] = P[- c, < X < c, I H 0 ] + P[X<-c 2 1 H 0 ] + P[X > c 2 1 Ho] 




188 A First Course on Parametric Inference 



E[<Pk( x ) I H 0 ] P[X <- d { \ + P[X > d x I H 0 ] if lc > .7979. 

For the standard level a = 05 the MP tect ™,;n ho , 

provided X(d \ > 7Q7Q m * haVC Cntlcal re 8 10n I x I > rf| 

p ovided Md,)>. 7979. Now from normal tables for a= .05. </, = 1.96 wit! 

M) = 1.2124 > .7979. The power of the MP test is 1 - 2 ta „-'(1.96). 
Example 8.3.5. We now give an P v am ni 0 u 

under H 0 and H, but 2(x) is discrete Let X hlwn ‘n "a * '* COn,inUOU: 
under H ] be given * L 1X be ^ under and its pd 


Here 


/(x) - 3, 0 < * < 1/4 or 2 <x < | 

=2 1 <X< 3 
3 ’ 4 Sjc ~4 

A(JC > = |. JSXS3/4 
_ 4 


-3, 0 < x< i or 1 <jc< 


Then v = X{x) has d.f. under R 


4 ” 4 


0 given by 


^oOO = 0 if y < 2/3 


Tests of Hypothesis -1 189 


- i. if 2 ^ 4 




The pmfs of X under //q and Hy are given by 


The MP level a test given by N-P lemma can be obtained by determining 
(k a , y a ) such that 

1 - P[Y <k(f\ + y P[y = k a ] = a 

or 1 -G(k a ) + YP[Y = k a \ = a. 

For example if a = .05 we must have (k a , y a ) such that under H 0 : P[Y < jy 
< .95 < P[Y < k a ]. Using G 0 (y) defined above we must have k a = 4/3 and 

therefore Y' — *05 or y — .10. Thus, the MP test rejects Hq when 


x € J J u \ J’ 1 J with probability .10 and otherwise accepts H 0 . The 


power of this test is .10 |^-jJ = .0667. 

Example 8.3.6. In all the examples considered above we had S 0 = Sy . We 
now consider a situation in which S 0 * Sy. Let (X u ..., X„) be i.i.d. U(0, 0) 
and let H 0 : 9 = 6 0 and Hy : 9 = 9y > fy. Here 

L(x, 0 0 ) = -Ln / ( o,e 0 )(jc,) = Lo(x) and 
L(x, 9y) = n I( 0 , 9 \)( x i) — Ty(x). 

N °te that S 0 n Sy = (0, 9 0 ) as 0! > 0 O and n ^ = (0 O , 0i)- Thus if any 
0 the observed jc,’s exceed 0 O then the MP test must reject H 0 . This is 
Univalent to defining (p(x) = 1 if x (n) > 9 0 . Recall that T = x (n) is minimal 
icient for (/(0, 0) and as such, L(x, 9) = g(r, 0) h(x), where 


Hence 



190 A First Course on Parametric Inference 


The MP test thus can be defined in terms of t only, and we consider 

A <0 = ° VSr ° <1<0 O' Then A (,) = (|f) for 0 < , < 0O and is 

(0 V 

constant. Here Mt ) takes only one value under H 0 , with probability 

one and under H A over (0, 0 O ) it takes the same value with probabilit 

(o V y 

[^7 J • N ° te that Undef Ha ° Ver [0 °’ 6l) ' is not defined but could be 


taken as + ~ ( an improper value) which is consistent with our rule that th 

MP test rejects H 0 i.e. <p(t) = 1, if t > d 0 . To determine MP level a test wl 
must determine (k a , y a ) such that 

1 - Pim <, k a I H 0 ] + yPlMt) = iy = a 
The d.f. of A(r)-is given by 


Go(y) = 


0 if >< ^7 


<>o r 


8oY 


= . 

(o V 

We must take k a = I I and then y= a. 
Thus the MP test (p*(t ) = 1 if t > 6 0 


= a if 0 < t < 6k 


The power of (p*(t) = a J 


e °ntHdt+ ( ex ^-dt 

o e; Joo e ? 




Now consider a test <pi defined as: 

<Pi(t) = 1 if t > B 0 

= y(t) if 0 < t < 0o 

where 0 < y(t) < 1 is such that 


C e ° nt n ~ l » 

J 




Tests of Hypothesis -1 191 


The power of <pi(0 


f fl ' nt n ~ l , f 


Ur 


00Y . fe„Y r*« 


9T 


r(0 


= 1 ~ f-jrl (1 ~ (X) 


in view of (B). 


This example shows that there exists several level a tests given by (B) whose 

power is same as that of MP test. For example consider a specific choice of 
y(0 given by 

7(0 = 0 if 0 < r < a 

= 1 if a < t < 0 O 

Then a is given by the condition that f° dt = a or l-( -S-Y = a or 

J a 0\ \Oo) 

a = 0 O (1 - a) 1/n and 


(p a (t) = 1 if t > a = 0 O (1 - a) l/ ” 

= 0 otherwise 

is a level a test and has same power as the MP test given by N-P lemma. 
Similarly if we take 

7(0 = 1 if 0 < t < b = 0 O a l,n 

and zero otherwise this will lead to a level a test having the same power as 
the MP test given by N-P lemma. 

This example shows that in general MP level a test is not unique. 

Example 8.3.7. Here we consider the same distributional set up as in the 
last example but consider H 0 : 0 = 0 O and H A : 6 = 6\< 0 O . Here S 0 = (0, 0 O ) 
and Si = (0, 0 t ) c Sq. Hence for any jc, e So n S[ i.e. x t e (0\, 0 O ) we accept 

^0 or (p{x) = 0 if any Xj > 0 t which is equivalent to t = max (*,) = x (n) > 0 le 
Thus (p{t) = 0 if t e (0 l5 0 O ). Now in this case 

A(0 = f-|^l for 0 < r < 0i 
= 0 for 0i < t < 0o 

We leave it as an exercise to the reader (using the technique outlined in 
e above example) to show that any test <p(0 such that 


192 A First Course on Parametric Inference 


<j0(f) = 7(r) 0 < t < 0i 

= 0 e^tKOo 

f° l nt "- 1 

is MP level a test where 0 < 7(0 - 1 is chosen such that I y (0dr = 

"0 17 0 

The power of this test is a an£ * one P oss ^^ e c hoice of 7(0 is 

7(0 = 1 if 0 < t < B 0 (a) l/n 
= 0 otherwise. 

Note that if 0! < 0 0 (a) lln , the power of this test is one. 

The above examples show that MP test given by N-P lemma parts (A) and 
(B) is not unique. However, we show in the next theorem that it is essentially 
unique. 

THEOREM 8.3.3 [N-P lemma (Part C)] 

For every ere (0,1], the MP test given by N-P lemma part (B) is essentially 

unique in that if (Pk a 0 0 and qf(x) are both MP level a tests given by N-P 
lemma then (pk a (x ) * (p'(x) only on set E 2 = [x\L\(x) = k^L^ix )}. Note that 

E[(p ka (x ) I H 0 ] = E[qf(x) I H q ] and E[(p ka {x ) I //,] = E[q>\x) I //,] and 
therefore if 


Q(x) = [(p ka (x) - <p'(x)] [LjCO - k a L 0 {x)] 

then J* Q(x) = 0. However as seen before Q(x) > 0 and therefore Q(x) = 0. 

Thus (p ka (x) = (p'ixO on E l and £3. However, since on E 2 , Lfx) = k a L 0 (x) 

we can define (Pk a (x ) s* (p (x) for which Q(x) = 0 and MP level oc tests can 
differ only on £ 2 . 


Remark 8.3.1: Using N-P lemma and Neyman factorizability criterion we cai 
show that, f under the joint pdf is £(*, So) and under H t itisLCr, 0 ,).Thei 

U.X 0,) = k Ux, S,) according as g(r, 0,) = k g(t, 0 O ) where T is minima 
sufficient statistic for [Ux. 0), 0 s £}}. Therefore the regions £, £,, £ 

test defined mt^T ? ' ** " MP ““ the " the “nespondin, 

(“ze)td1leT r n r' SUffiCient Statistic «• MP test of same leve 
test and we define = £^11 tWs si,uation is if is an : 

S '■ Further as Tis minimal srfficillr as ° S £ *' 0 < 

V(t) which is a test function h*™ a • does not de P enci on 9 and equal 

sufficient statistic T and ElmM | n 0 | n 8 p n ,°r r SerVations onl V throu 8 h miniraa 
function , W there is an equiva e mre^f ' r =,] »- Thus for an * 

only through minimal sufficient statistic r UnCt ' 0n W) Which de P ends ° n 


♦N 


Tests of Hypothesis-1 193 

In the next section we show that for one , 

MP test S' ven ky N-P lemma also provides » * ex P° new ‘ al family the 
problems such as testing H 0 : e = ft, against H j OI “ t,on to m <>re complex 
Hy 9 > <>o S * H]: 8> *><* «o: 0 S 6b against 

Exercise 8 J. (i) Show that for a MP level a test no 

v , <e 5„ u S,]. ° test ' power i level (Hint: take «,-(*) s a, 

(ii) Let Ho - 1/(0, 1) and H, : X -/ lU) where 

/lC*) = 4x, 0 < jc < 1/2 

= 4 _ 4 x 1 ^ 

^ ^ * < 1. 

Obtain MP level a test and its power. 

(iii) Let be id - d - exponential with location ™ 

jr > 0. Obtain MP level a test to test H {) : Q = q aa • ® = e *P {- (jr - 0)}, 

cases 0| > 0<> and 0, < 0 O separately. 0 gainSt " x ' 9 = Di scuss the two 

(iV) U1 .. Xn) 66 i i d - Paret0 with s^pe parameter A i.e f( x n - A. 

"'r< ; ^ = a8ai " St H ' - X "= *" <1^0 ca'ses 

testing H 0 : X - H(0, I) ag^fnsfito"f)^ObtaTn"" 3 ° b,a '" MP level “ tesl for 
that power > level. '»• Ohtam power of this test and verify 

(Vi) Let X - H(0,1) under H 0 and MO, 2) under H,. Suppose mix) - 1 if I H > i „„h 
zero otherwise. Is <p(x) MP test of level equal to its size’ ^ ~ 1 lf I * I > I and 

(vii) Let X be a discrete r.v. with pmf under H„ and H, given by 
Po(x) = .05 for X = 1, 2, .... 20 
Pi(x) = .60 for x = 1, P[ ( X ) = .15 for x = 2, 3 


and 


PM = jj for x = 4, 5, ..., 20. 


1/2 if r - , f ’ 2 and zero otherwise and <Pi(x) = 1 if x = 1, and (fh_{x) = 

an . " ;; and ZCr ° otherw,se - Sh ow that both <p { , ^ are MP test of level a = 10 
and power .75. Do p,, ^ satisfy NP-lemma? 

(vin) Let X be a non-negative r.v. with pdf under H 0 given be Weibull density/ 0 (jc) = 

I" 




w W - - -JUW “ 

0 and under //, given by /, (*) = e~ x2, \ x > 0, (folded normal). Based 

n a single observation obtain MP level a. test and its power. 

sinpj X) e ) = 0(2*) + (1 - 0) 2(1 - jc), 0 < x < 1, 0 < 6 < 1. On the basis of a 

(h\u ° servadon obtain MP level a = .05 test to test (a) H {) : 6 = 1/4, : 6 = 3/4, 

H{): ds 3/4 < : 0 = 1/4, (c) H 0 : 0 = 1, //, : 0 = 1/2. 

®- 4 Uniformly Most Powerful (UMP) Tests 

u Ppose we want to test H 0 : 0= 0 O against H A : 9> 9 0 . Here null hypothesis 
simple and alternative is composite. Let D a = {<jp I E[(p{x) I 9 0 ] < a) be 
e c, ass of all level a tests. A test (p* e D a is called UMP if 


E[(p* I 9] > E[(p I 0], V 0 > 0 O , V (pe D a 


(8.4.1) 



194 A First Course on Parametric Inference 


Let /y 0) denote E[(p I 0] which we call as the power function of the test 
Note that /y 0 O ) is the type I error and ^(0,) for 0, > 0 O is the power at 0 
or 1 - /V #i) is the type II error at 0,. We now show how N-P lemma ca' 
be used to obtain UMP test by way of an example. an 

Example 8.4.1. Let (X h ..., X„) be a random sample of the size n from 
{N(9, 1), 0 € /?! }. Suppose we want to test H 0 : 0 = 0 O against H A : d>Q 
Consider a sub-problem of testing H 0 : 6 = % vs : 0 = 6», > 0 O a specific 

alternative. Then using NP lemma we have MP level a test for the subproblem 
is given by 


(P\(x) = 1 if x > 0 O + 

yn 

= 0 otherwise 

and its power function is 

Ap(0) = 1 - <p[Vn(0 o - 0) + £,_«)] 

where is 100(1 - a)% point of MO, 1). Observe that the MP test <p, does 
not depend on the particular alternative 0, chosen for the sub-problem. Therefore 
<p, is MP level a test for tf 0 :« = #o vs H' A : 8 = 0, for each 0, > 0 O 

and therefore we conclude that <p, is UMP level a test for the original problem 
H 0 : 0 = 0 O vs H a : 0 > 0 O . 

We also note that ^(0) is monotone increasing function of 0 as 
0[Vn(0 o - 0) + £,_ a ] is decreasing function of 0. Further lim /y (0) = 1 and 

0->OO V 

P<p(0 b) = Cf. 

Example 8.4.2. Consider Example 8.3.6 of testing H 0 : 0 = 0 O against 

: 0 = 0! > 0 O in {t/(0, 0) 0 > 0}. Here we showed that the MP level a 
test is given by 


<Pi(f) = 1 if t > d 0 

= Y(t) if 0 < t < 0 O 

where y(t) satisfies the condition E[y(t) I 0 O ] = a. We have also seen that the 
above MP level a test is not unique. 

Since <p, does not depend on the particular value of 0, but its relative 

position w.r.t. 0 O , namely 0, > 0 O we have <p { is UMP level a test for testing 

0-0-00 against H A : 0 > 0 O . This example also shows that the UMP level 
or test is not necessarily unique. 

We now show that the results of Example 8.4.1 for the N{6 , 1) hold in 
general tor one parameter exponential family. 

n«i et ■*’ Xn) be random sample of size n from pdf belonging one 
parameter exponential family so that P g 8 



195 


Tests of Hypothesis-I 

log L(x, 9) = u(9) £ K(X,) + nv (g) + £ ^ 

Suppose we are interested in testing H 0 : 9= 9 0 a g ainst h a : 9 >9, then 

consider the subproblem H 0 : 9 = vs H - . e _ „ 0 

" s 5 , = 5 a „d therefore "n . 0 - 6 , > e 0 . Not e that here 


Li(*) 


log A (*) = l 0 (x) = ^ *(*»)[«(#i) - «(0 O )] + _ v 


(*o)l 


and by N-P lemma the MP level a test for H 0 against H' A , assuming that 
M (0,) > u(G b), is given by 6 


(Pi(x) = l if lK(x;) = T>k a 

= 7a if I K(Xj) = T=k a (8.4.2) 

= 0 if 2) K(Xj) = T < k a . 

If u(9 x ) < u(9 0 , ) then the MP level a test is given by 

(hix) =1 if I K(x,) = T < k' a 

= Ya if I K( Xi ) = T = k' a (8.4.3) 

= 0 if I K( X j) -T > k' a . 

Here 7=1 K(xj) is minimal sufficient statistic for {L(x, 9), 6<= ft}. Further 
for the test given by (8.4.2) k a and y a are determined by the condition 


1 - Pe 0 [T < k a ) + y a P do [T =k a ) = cc (8.4.4) 

If u(9 x ) < u(9q) then the MP test is given by (8.4.3) and k' a and y' a are 
determined by 


P d0 [T < k a ] - P eo [T = k a ] + y' a P Bo [T = *„] = « (8.4.5) 

In both cases the MP test depends on the distribution of T under H 0 and 
whether u(G x ) > u(G 0 ) or u(G l ) < u(G 0 ) and not on the specific choice of 0!. 

N ° W as % * °> «(0) is increasing or decreasing. In case w(0) is increasing 

M ^i) > «(0 O ) and MP level a test for H 0 against H A is given by q> x i.e. 
(8*4.2). Since (py does not depend on the specific value of 6\ as for any 
?! € h a, u(6y ) - u(G 0 ) > 0, (Pi is UMP level a test for testing H 0 against H A . 
0n the other hand if u{6) is decreasing then u(9 x ) < u(9 0 ) for every G x e H A 
and 02 given by (8.4.3) is UMP level a test for testing H 0 against H A . 
if S A milarl y’ we can show that for testing H 0 : 9 = 9 0 againsl : H A . 0 < 
" (0) is increasing i.e. «(0.) < u(8 b). <h is UMP level a test whereas tf «(0) 
greasing i. e . „( 0 ,) > u( e a ) <p t will be UMP level a test tor the problem. 
We We next show that the power function of these UMP tests are monotone, 
consider the case where u(9) is increasing and we are testing o- 0 


196 A First Course on Parametric Inference 


against H A : 9 > 0 O . As seen earlier in this section, the UMP test in 
case is given by <p { as defined in (8.4.2). The power function of thic? ^! S 
given by 


P „(«> = 1 - PIT < k a I 0] + y a P[T = k a I 0 ] 


(8.4.6) 


We will show that /? 9l (0 2 ) > P n {9 x ) if 0 2 > 9 X so that 0^(0,) is increasing 
Consider testing H 0 : 0 = 0 { against H A : 9 = 0 2 where 0, < 9 2 , Then 
u(9 2 ) > u(9i) the MP level a, = [p Vl (9 x )] test is given by ** 

= 1 if T>c a 
= 8 a if r = c a 


= o if r < c a 


where c a , 8 a are determined by 

1 - P[T < c a I 9 X ] + S a P[T = c a I 0,] = 0^(0,). 

This implies that k a = c a and y a = <5 a and <?,' is same test as <p, or <p[ = <p,. 

But P<p\ (# 2 ) Z (0,) as the power of a MP test is at least as large as its 
level as seen in Exercise 8.3(i). But as (p[ = (p x we have p' <P] (0,) < P V{ (0 2 ) 

and the power function of p Vi (0) is increasing. Similarly if u{9) is decreasing 

then one can show that the UMP test fe for this problem has power function 
increasing in 0. The remaining cases of H 0 : 9 = 0 O vs H A : 0 < 0 O and u{B) 
mcreasmg ( greasing) can be handled in a similar way. For example consider 
o o» vs a • < u(6) increasing. Then UMP level (X test is given 

' y ^ as given in (8.4.2) with constant ( k{, y' tt ) given by (8.4.5). Now consider 

testing H 0 : 0 = 0, against H A : 0 = e 2 < 0 t at level 0 ,, (0.). Then MP test 
of level P V2 (9 ,) is given by 

9i(x) = 1 if T < c a 

= S a if r = Co 
= 0 if T > c a 

Here c a , 8 a are given by 

Pn (0,) = P e [T <, c a I 0,] - p[ 7 - = Ca | 0i] + g^ T _ Co | S)] 
which is the power of the test * at 0,. This implies that c„ = tt ** 

?»)*<« H ° Wever **<•*> and thus 
S “2S 1 , 1 - *» » ft i. 2L.(i La - 



Tests of Hypothesis-I 197 


Example 8.4.3. Let ( X,, ..., X„) be i.i.d. Pareto wih pdf /(*, A) = ^ 

, > 1, A > 0. As seen before A), A > 0) is one parameter exponent 
family with «(A) = - (A +1) and T = i i og „ as minimal sufficient statistic 


A" 


with g„(t, A) - r(n) S* . r‘, , > 0, A > 0. Now h'(A) = - 1 and «(A) is 

decreasing and as T is continuous r.v. the UMP level a test for testing 
Ho ■ A - 1 vs H a . A < 1 is given by f(x) = 1 if T > t tt and zero otherwise 

where is given by £ g„( f , 1) = a or i _ G „(, a) = a where G „ (u) is the 

d.f. of G(n, 1) r.v. Thus t a = rj n l _ a the 100 (1 - a)% point of G(n, 1). The 
power function of this test is given by 

AOO 

J «„(r, = l _ G„(At 7„.,_ <; ,) = y3 v (A). 


Now 


dp 9 ( A) 
dX 


^i(^n,i-a) r ln,i-a < 0 as Tj n[ _ a > 0. Therefore is 

decreasing. On the other hand if H 0 : A = 1 and H A : A > 1 then MP level a 

test is given by <*>'(*) = 1 if T < k' a and zero otherwise where 

G n (k' a ) - a or k' a = rj n a i.e. M.P. level atest is left tailed. The power function 
of this test is given by ^(X) = G n ( Xr\ n>a ) and is increasing in A. 

Example 8.4.4. Consider (X,, ..., X n ) as i.i.d. b { 1, 0). Then {L(x, 0), 0 € 
(0,1)} is one parameter exponential family with u(G) = [log 0- log (1 - 0)] 

and T= X X, as minimal sufficient statistic. Here u'(G) =- - > o and 

therefore u(G) is increasing. Suppose we want to test H 0 : G =^0 O against 
: G > Gq then UMP level a test is given by 

(p(x) =1 if T > r 0 

= 7a ifT=r 0 
= 0 if T < r 0 


where r 0 , y a are given by 


" 'n ' 


1 

r 0 + l 






Kn>) 


e r 0 (i-e 0 r r + /« 

The power function of this test is given by 


0i°(l-0o)"" ro =«. 


P 9 (0) = I 

H)+l 




0 r (l - G) n ~ r + y a 


'° TI \ r J W 

he reader can verify that j3^(0) is increasing in 0. 


0'«(1 - 0 o ) n ‘ n) for 0 > 0 O 



,98 A First Course on Parametric Inference 

_ •«, ft a a) (Inverse binomial sampling). In a production process rather th* 

Sterns and' finding out number of items which are defective, we use the i„ verse 
nrocedure. Here we continue testing items until m 0 defective .terns are found wh!, P > 
t a pre-assigned number. The total number of items thus examined would be „ 
where n is the number of non-defective items in the sample. In usual binomial sam „,. % 
the number of items examined is fixed and number defective items is random. Forinv!" 8 
binomial sampling the number of defective items is fixed and number of items examine 

is random. The pmf is P[X = x] = ^ °0 - 8) > x = 0,1, 2,..., o < ■ 

Show that the distribution belongs to one parameter exponential family with u($) decreasing 
and where x is minimal sufficient. For testing Hq . 0 = . 1 against H A : 0 > . 1 find Ujqp 
level a = .05 test and its power function when m 0 = 2. Also do the similar exercise for 
m 0 = 1 and m 0 = 3 and compare the power functions of all the three UMP tests. 

(ii) Using the solution in Exercise 8.3 (iii) obtain UMP level a tests for (a) H 0 : 0= e 0 , 
H A :0<0 O and (b) H 0 :8 = 0 O , H A : 8> 0 O . Verify that their power functions are monotone! 

(iii) Let (X lt ...» X n ) be i.i.d. with pdf/(x, 6) = d/x? for 0 < 6 < x < °o. Using results 
on Pitman family (Ch. 2, Sec. 6), we use t = x (1) , the minimal sufficient statistic with pdf 

g(t, d) = ne7/” + \ d < t < oo. 

Obtain UMP level a test for testing H 0 : 6 = 6 0 against H A : 0 = 0, < % After having 
obtained the results from the first principles verify that using the transformation y = — 

X 

have y - t/^0, -^-J and results from Examples 8.3.5 and 8.4.2 can be used directly. 

(iv) Let (X u ..., X„) be i.i.d. G(A, 1) with pdf f(x, A) = 


we 


r( A) 


e- Jf x A - 1 , x> 0, A>0. 


Obtain UMP level a test for testing Hq : A = 1 against H A : A > 1. 

(v) Let (Xj, ..., X n ) be i.i.d. with pdf f{x, 6) = ftc 0-1 , 0 < x < 1, 6 > 0. Obtain UMP 
level a test for testing Hq . 6= 1 against H A : 0 < 1. 

We conclude this section with the observation that for one parameter 
exponential family the UMP test derived for simple null hypothesis 6= 0q 
against one sided alternatives 9 > 6 0 (or 6 < 6q) continues to be UMP test 
for testing one sided composite null hypotheses 6 < 6q (or 9 > 9 0 ) against one 
sided alternatives 9 > 9 0 (or 9 < 9q). This is a consequence of the monotone 
nature of power function 15^(9). We illustrate the case for testing Hq : 9< 0o 
against H A : 9> 9 0 and where u(9 ) is increasing and leave the other three 
cases as an exercise to the reader. 

Let D a denote the class of all level a tests for Hq : 9 < 9 0 i.e. 


E>a ={<p\p <p (9)<a, V 9<9 0 ) ( 8 - 4/7) 

Consider the boundary point 9 = 9 0 of H 0 and any 9 e H A . Then by NP 
lemma, MP level a test for the subproblem Hq : 9 = 9 0 vs H A : 0 = e i ,s 
given by (p { as defined in (8.4.2) with constants k a , y a given by (8.4.4). The 
P level a test <jpj does not depend on the specific choice of 9\ € H an 
therefore is UMP test for H' 0 : 9= 9 0 against H A : 9 > 9 0 . We now show that 

A* “ as de ^ ne( ^ i n (8.4.7). We have already seen that the power functio° 
<pi ) is increasing for 9 > 0 O . We now show that /?p,(0) is increasing f° r 


Tests of Hypothesis-l 199 

9 < % also Let #' < 0' <0 O and consider testing the simple null hypothesis 
Ho‘0- e a S ainst the alternative H'f : 0 = Q' at lev el a (e „, _ 1 p 

* ! - 11 " ■* :-zv '■! ? r"” 1 “- p 1 

increasing, the MP level P 9l (d ) test is given by 


<p r=i ifr>*" 

= Y'a if T = k£ 

= 0 ifr<*£ 

Here constants k'„, y„ are determined by 

P<p i(0") = 1 ~ ^IT ^ 1 0"] + YaP[T = *£ I 0"] 

and therefore K = and y" = y a and <*>,"= <p L . The power of MP test 
pfat0'=/3 vr (0') is not less than the level j3„ r (0"). But <pf= <p, and 
therefore for 0" < (T < 0 O we have /J„(0") < j3 9i (0') and /^,(0) is increasing 

for 0 < 0 O . Therefore (p { e T> a as defined in (8.4.7) and would be UMP level 
a test for H 0 : 0 < 0 O against H A : 6> 0 O . 

It now follows that UMP level a tests derived in Examples 8.4.1, 8.4.3, 
8.4.4, 8.4.5 and Exercises 8.4 (iv) and (v) are UMP level a tests for one 
sided composite null hypotheses. 

Example 8.4.5. Let (Xj, ..., X n ) be i.i.d. t/(0, 0). Suppose we want to test 
H 0 : 6<0 O against H A : 0> 0 O . Then consider the sub-problem Hq : 0 = 0 O 
against H A * 0 = G\ > 0q. From Example 8.3.5, MP level oc test is given by 

<jpi(0 = 1 if t = x w > 0 O 

= y(t) if 0 < t < 0 O 


where 0 < y(t) < 1 is such that [ y (0 dt = a. As the MP test does 

Jo 0o 

not depend upon specific alternative 9\ e H A we have (p\ is UMP level a test. 
The power function for any choice of y(t) given by 


/?„W = l-f^T(l-«) fore>6,. 




0 


Since y(r) can be chosen suitably let y(t) = d for 0 < t < 0q. Then an UMP 
test of level a is given by 

(p*(t) = 1 if'>0o 

= a if 0 < t < 0q. 


N°w f 0 r q ' < 


0O>PJ9') 

^1 



a • nt_ n _ 1 

prn 


dt - a. Therefore P *(0') < a, 

<P[ 




200 A First Course on Parametric Inference 


V 9' < % and (p* is level a test for composite null hypotheses H 0 : 8< an 
continues to be UMP level a test for H 0 :9<9 0 against H A :G> 9 0 . 

The connecting link between one parameter exponential family and Pit ma 
family is the Monotone Likelihood Ratio (MLR) property. Let [L(x, d), 

Q c Ri } be the family of joint pdf of (X x , ... X h ). Consider fy, 0, with 0, > a 
in £2 then {L{x, 8), 0 e Cl c Ri) is said to have MLR property i n a real 
valued statistic T(x) if L(x, 0 x )/L(x, d Q ) is a monotone function of T(x) i e 
it is either non-increasing or non-decreasing function of T(x). Note that if 
T(x) is minimal sufficient for 6 then we can consider L(x, 9 X )/L(x, 8 0 ) - 
g(t, 9 x )/g(t, 8 0 ) and check whether this ratio or logarithm of it is a monotone 
function of T. It is easy to check that the one parameter exponential family 
has MLR property in T = I K(xj) and 1/(0, 6), 6 > 0 has MLR property in 
T = Max (Xj, X n ). In the first case 


log L(x, 0,) - log L(x, 8 0 ) = (u(0 x ) - u(8 0 )) I K(Xj) + n(v(0 x ) - v (8 0 )) 


and is strictly increasing or decreasing in T according as 
In {L/(0, 9), 6 > 0} case 


du 


du 


#> 0ori £<0. 


de 


Ux,e,) 

L(x, 0 q) ^ 9 X j 


n 


if 0 <t<9 0 

t 


= 00 if 6 q < t < 6y 


which is ar non-decreasing function of t = 

Let [L(x, 6), 9 e £2} be a family with MLR property in some statistic T. 

Suppose we want to test H 0 : 6= <*, and H[ : 9= 0, > and where 

is say non-decreasing function of T. Then NP-lemma gives the test (p x given 
by (8.4.2). The proof that <p x is UMP for H 0 : 8= 0 O vs H x : 9> 9 0 and that 

P<py(Q) is increasing in 0is word to word same for this case as it is in case 
of one parameter exponential family and therefore will not be repeated here. 
Similarly q> x continues to be UMP level or test for the wider problem H 0 : 9 
— o vs H x . 0 > 9q and the proof is left to the reader as an exercise. 


Exercise 8.4 (contd.). ^ (vi) Show that fix, 0) _ Six 1 , 0 < 0 < x < has MLR propert; 
"X ? ™ P test for H o -9Z0 O against H X :0<0 O . 

1 r a sam P e 0 s * ze one from the Double exponential distribution with pd 

A*.9)-^exp(-|j:-0| ],* 6 ^ show that ^ 0 ) $€ R,i has MLR properr 

“‘’(vUOLeTb'n 1 “ = ,° 5 ^ ,eS " ng H °' ei0 gainst 0 > 0. 

, j sample of s,ze one fro, >> Cauchy distribution with fix, fi) = 

* i + U-jr) 1 ’ * E *>' M € *t- Check whether [fix, p), M e R,) has MLR property. 

tes^g Tsimnlf Zi S b hat general UMP level “ does not exist foi 

8 mp ' e nuB "ypothosis H 0 : * = flb vs H a : 6 * * a two side< 


Tests of Hypothesis-I 201 

..native. We also give an example in which the UMP test exists for testing 
fsin>pl e null hypothesis against two sided composite alternatives. 

5 Non-existence of UMP Tests 

mow consider a situation in which we want to test H 0 : 9 = 0 O against 
£ ; 0* %■ We first show by way of an exa mple that for a random sample 
0 f size n from say N(9, 1), there does not exist UMP level a test for two 

sided alternatives. 

Example 8.5.1 Suppose if possible there exists an UMP level a test ( p * for 
// 0 : 9 = 0o vs H a ■ 9* 9 0 . Consider a subproblem of testing H 0 : 9 = 9 0 

against H A :9> B 0 . Then by Example 8.4.1, UMP test for this subproblem 
is given by 

(p\(x) = 1 if Jc > 0 O + 


= 0 otherwise. 


Now as seen earlier, for any test (p(x) there exists a test based on minimal 
sufficient statistic x given by E[<p(x)\x] = y/(x) such that /y 0) = p ¥ (9), 
V 0e Q. Therefore wig we can take (p*(x) to be a test based on <p*(x). Let 
P 9 *{0) be its power function. Then P v * (9) = P V[ (0), V 9 > 9 0 because if 
this were not true then there exists a 9\ > 0o suc h that either Pq,*(9i) > 
jy(0,) or 0^(0,) < /^,(0i). In case the first possibility holds this contradicts 
the fact that <p,( x ) is UMP for H 0 : 9 = 9 0 against H' A \9> 9 0 . In case the 
second possibility holds then this contradicts the fact that q>* is UMP for H A 
which includes H\. Therefore p r ( 9 ) - P Vx (0) = 0, V 9 > 0 O . However 

in view of completeness of the family * N 9, — 9> 9q * we must have 


Next consider the problem Hq : 0 = 0o against H A . 0 < 9q. Then UMP 

Elates, is given by <p 2 (x) = 1 if if < ^ and zero otherwise. Arguing 

“ above we first claim that /J„ 2 (@) = for 9 ' 6(1 a " d the " US, " g 


C0D >Pleteness of the family \n{ 6, 8<e 0 \ conclude that 


if* to a contradiction that * - <h fact . ^ * % f ° a J t 

Thus there does not exist any UMP test for HoJ- ^ ag 
£*.*■ In course on statistical methods we come across 
ta 'l cr iterion test given by 



202 A First Course on Parametric Inference 


(j 0 3 (jc) = 1 if x £ 



I 


l-a/2 





J 


- 0 otherwise. 


This is a level a test for two sided problem with power function given by 
p v ,(8) = 1 - <P((Q) - 6) Vm + 4,.a/ 2 ) + 4>o/n(0o - 0) + 4m) (8.5.1) 

One can verify that for 8 > 0 

o + 8) < P n {6 o + 8) and P^Bq - 8) < p 92 (9 0 - 8). 

However p n (6) < p(B) for B< B 0 and p n (6) < p^ (0) for 0 > 0 O . 

Thus we can view <p 3 as a compromise test although it is not UMP for two 
sided alternatives. In fact <pi and (p 2 are biased tests for H 0 : 0 = 0 O vs 
Ha : B 9q. 

Analysis given here for N(9, 1) case can be extended to the one parameter 
exponential family case in a straightforward manner where T - £ K(xj) is 
minimal sufficient statistic and we can restrict ourselves to tests based on T 
with pdf {g(r, 0), 6 € Q.) which itself is a one parameter exponential family. 
Again sub-families {g(f, 6), 0 > 0 O } and {g(t, 6), 9 < 0 O } are complete. If 
we assume that there exists an UMP level a test for H 0 : 6 = 0 O against 
H a : 0* 9q given by (p*(t) then we can show that (p\{t) = <p*(t) = (fait) where 

(Pi(t) and (fa(t) are UMP level a tests for one sided alternatives H' A : 9> 6q 
and H a :B<9q respectively. However (p\(t) * (fait) for any t and thus we 
have a contradiction. 

We note that in the above proof of non-existence of UMP level a test for 
two sided alternatives, completeness of subfamilies {g(f, 9), 9 > 0o) and 
{g(r, 9 ), 9 < 0o} plays a very crucial role. Now if any one of the above two 
subfamilies is not complete the proof will break down and there may exist 
UMP level a test for two sided alternatives. We show that this is indeed the 
case for 1/(0, 9) model where t = is minimal sufficient and as seen earlier 
in Chapter 3, {g(f, 9), 9 > 0 O } is not a complete family. 

Example 8.5.2. From Examples 8.3.6 and 8.4.2 for testing H 0 : 9 = B 0 vs 
H\ : 9 > 9q the UMP level a test is given by 

<Pi(t) =1 if t > 0 O 

= 7i(t) if 0 < t < 0 O 

where y,(f) is chosen so that J ° y, (?) 2^1 dt = a. Thus UMP test is not 

tvitn WC ex P^°*t this nonuniqueness to determine UMP test f° r 
s » e alternatives. Consider H 0 : 9 = 9 0 against H A : B=0 { < B 0 th en 



Tests of Hypothesis-I 203 

A(O = W0,)" ifO<,< 0| 

= 0 if 0, S t < Bo. 

Here a n UMP level a test is given by 

%(') = W) if 0 < r < 0, 

= o if e, s t < et, 

where K«) is such that y 2 (r) SL— dt _ „ The power 

^2(^1) = ^ ri«)~-dt = (e 0 /e l ya 

if (9ol0\) n oc < 1 and otherwise it would be unity as /? V2 (0) < 1 for any 0 
Here the MP level a test appears to depend on particular However if y 2 (r) 
is appropriately chosen then we can get around this problem. Towards this 
end, let %(f) = l,O<r<a<0 1 such that 

r nt n ~ l 

—^r dt = « OTa = e 0 a lln 

Jo &0 

The power of the test <p 2 is 

P<p' 2 (#i) = f ~~ dt if a - 9 0 a lln < 6 X 
Jo V\ 

= 1 if 0\ < 0 o a l,n = a 

Thus 

(f> 2 {t) = 1 ifO<f<0 o a 1/n 
= 0 t > 6 0 a lln 

|s UMP level a test for H 0 : 6 = Q 0 against H'j[ : 0 < G 0 since (p 2 (t) is 
^dependent of particular d l e H%. 

The power function of (p 2 is given by 

= fff-Y « if < 6 > 

= 1 if 0! < 0o« 1/w - 

w define <p* to be a combination of (p'j and (p\ so that 

(p* = 1 if 0 < t < 0o # 1/n 
= 0 if 0 o a l/ " < t < 0o 
= 1 if / > 0<) 



204 A First Course on Parametric Inference 


Then <p* is UMP level a test and its power function is given by 

P<p*(0) = 1 if 9 < 9 0 a lln 

= (Go/9) n a if 9 0 a l,n < 9 < 9 0 
= 1 -(9o/9) n (l - a) if 9> 9 0 

When H 0 and H A are both composite hypotheses one can show that UMP 
tests generally do not exist particularly when nuisance parameter is present 
i.e. when 9 is vector valued and H 0 and H A are statements about say 0, only. 
For example such a situation occurs in testing H 0 : 9 = 0 against H A :9> o 
in N(9 , cr 2 ) family where o 2 is unknown. Here both H 0 and H A are composite 
and o 2 is called as a nuisance parameter and 9 is called as the parameter of 
interest. The reader familiar with standard statistical methods will observe 
that although UMP test does not exist in thi s N(9, a 2 ) c ase, the well known 
Student’s t test which rejects H 0 if V/j(*)//s 2 /(n - 1) > c is used in many 
applications with c = t n _ u _ a the 100(1 - a)% point of Student’s t with n -1 
degrees of freedom. Chapter 9 gives the above test and several other well 
known tests used in Analysis of variance, regression analysis, testing equality 

of proportions of two samples from binomial distributions etc. using the 
method of Likelihood Ratio Test (LRT). 8 


-Tests of Hypotheses-II 


9.1 The Likelihood Ratio Test (LRT) 

Consider the problem of testing tf 0 :0 € n . 

?, »t r r r'”- 1 *• **■ o, z 

HKehhood o^e fere on. «„ L l( jc, «,) the maximum of the likelihood of 

0 for 0 & We make this comparison in a sliohtii/ / 4 'tt 

defining the likelihood ratio test statistic (LRTS) § Y erem Way by 


A (*) - sup L(x , 0)/sup L(x, 0) 

6e£1 0 0e£l 


(9.1.1) 


where Q = £2 q ^ Qi* 

Note that 0 < A(x) < 1 as max L(x, 0) > max L(x, 0) since Cl 0 c Q. The 

tfeii 0efto 

LRT based on A(x) follows the logic of likelihood ordering (Sec. 1.3) and 
assumes that small values of A(x) near zero indicate support to H { . Thus LRT 

rejects Hq if A(x) < c where c is chosen so that sup P[A(x) < c \ 6\ - a 

fleO 0 

where Of is the specified upper limit to type I error or level of the test. We 
first give a few examples which show that the LRT based on A(x) leads to 
MP or UMP level Of tests derived in the previous chapter. We also observe 
that if T is minimal sufficient for [L(x, 0), 0€ £1} then we need to consider 
the LRT based on {g(r, 0), 0 e Cl) only, with LRTS 


A (f) = sup g(t, 0)/ sup g(t, 0). (9.1.2) 

0€fto 0eO 

Example 9.1.1. We first consider the case where Q 0 = {0 O } and £2, = {0!} 
and show that in this case the LRT leads to the MP level Of test given by 

N 'P lemma. Note that sup L(x, 0 O ) = L(x, 0 O ) since Q 0 = {0 O }. Further 

0€iQo 


sup L(x, 0) = L(x, 0 O ) if L(x, 0 O ) ^ L(x, 00 

ffeCi 

= L(x, 00 if L(x, 0 O ) < L(x t 00- 

Therefore the LRTS 


A(jc) = 1 if L(x, 0 O ) ^ Ux, 0i) 



206 A First Course on Parametric Inference 


= ^ if L(x, 6 0 ) < L(x, d { ). 

L(x, do 

We reject Hq if X{x) < c such that P 0Q [M x ) <c] = (X. If c = 0 then we have 
P 0Q [X(x) < 0] = P 0o [Hx) = 0] = 0 and if c = 1 then P 0Q [Mx) ^ 1] = 1. As 
in most situations of interest, we have 0 < ot < 1 and we therefore take 0 < 
c < 1. The LRT then rejects H 0 if [Mx) < c] and c < 1 and is therefore given 
by 

(p(x) = 1 if L(x, d 0 )<cL(x, d0 
= 0 otherwise 

where c is such that P 0o [L(x, do) < cL(x, 0j)] = cc. This is same as the test 
given by N-P lemma namely 

cp*(x) = 1 if L(x, d 0 ) < cUx, dO 
= y if L(x, d 0 ) = cL(x, dO 
= 0 otherwise 

where (c, y) are such that q>* is size (X test or E 0Q [(p*(x)] =? a. 

Example 9.1 .2. Let (X b ...,X n ) be a random sample of size n from N(d, 1 ). 
Suppose we want to test Hq : d < d 0 against H\ : d > do so that Q, = R\. Now 

L(x, d) = g(x, d) • h(x) where X = — DX, is the minimal sufficient statistic 

n A. 

for the family { N(d , 1), d e £2} and as seen before d = x and 
sup L(x, d) = g(x,x)h(x). For d € Qq, using techniques of Example 7.2.2 

0 El) a a 

we have d 0 = xifx < d 0 and d 0 = d 0 if x > d 0 . Therefore 

sup L(x, d) = g(x, x)h(x) if x < d 0 

ffeCi o 

= g(x, d 0 )h(x) ifx > do 
or 


A (x) = 1 if X £ 0, 


= exp 


Mx - do ) 2 


if x> d 0 . 


Thus we accept H 0 if i S and reject ff# . f . > ^ 

r *(*- g o) 2 l < .... 

2 j-c which is equivalent to rejecting H 0 if x > d 0 + k 

where k is chosen so that sud P\Y s a j m y _ __ 

a 1 0 + However P e [X>e 0 + k] 


exp 



Tests of Hypotheses-Il 207 

s , _ 0lM9o- 9 + *)1 which is an increasing function of 
p[Mx) ^ c] therefore occurs at 0 O and is 


sup 

9 ' 00 i e 

is determined by 1 - 0[ink] = a or k = 


0 € (- oo, 0 O ] and 
given by 1 - <j>[^nk], Thus k 

■sj n ~' LRT is then given by 


Pi(*) = l 


if * > 0 O + 


Cl-« 

Vn 


- 0 otherwise 

We have already seen in the previous chapter that <p,(x) defined 
in fact UMP level a test for testing H 0 : e < e 0 vs H t . 6 > e 0 . 


above is 


Example 9.1.3. Suppose that in Example 9.1.2 we have H n : 9 = ft, and 
H\:0* 9o then 


^ (*) - g(x, 0 0 )/g(x, x) = exp \ 


n (* ~ Ap) 2 ] 


Now X(x) < c is equivalent to n(x — 0g) 2 > k and we must select k such 


that Pdo[ w (X e o) 2 ^ k] — ex. But under 9q,X ~ N^0 O , -^-J and therefore 
n(X- 0 O ) 2 ~X\ and this gives k = Xi,i- a = ??_ a/2 , where $ a is 100a% 

point of N(0, 1). Therefore LRT rejects Hq when the x > 0 O + -- ** 12 or 

Vn 


£ a/2 £ 

x<9 0 - - — = 0 O + S “ /2 . This is the usual equal tail criterion test that 

V/i Vn ■ 

we considered in Example 8.5.1, as a compromise test. Note that here UMP 
level a test does not exist. 

We have already observed that the power function of this test is symmetric 
around 0 O and that as 0 -» ±°o, power goes to 1. Further j3 <p ( 0) > /3 <p ( 0 O ) = a and 
/y 0) has minimum at 0 = 0 O . A test (p such that 


sup 15^(0) = <x< inf ptpiO) (9.1.3) 

0€fl O 

has the property that its power is never below the level oc which specifies the 
Maximum possible type I error. A test having this property is called as an 
^biased test. The unbiasedness is a desirable property for any test since 
J a te st is biased there exist an alternative 0i € and 0 O € Qq such that 
W) > Ap(0,) or P [Rejecting H 0 I H 0 is true] is greater than P [Rejecting 
1 *0 is false]. One can show that the LRT derived above is UMP test for 
0: 9 * 0 O vs H x : 0 * 0 O within the class of all unbiased tests of level a 

1,C ' Cons ider a sub-class of D a namely Dg 0 = {<*> I € D a , 0, (0 O ) < 
M®)}. We will however not derive this result which depends on 
?et 8 the Seized Neyman-Pearson Lemma. The interested reader should 
er t0 Lehmann (1959) for further details. 



208 A First Course on Parametric Inference 


Example 9.1.4. Next consider the problem of testing composite null hypotheses 
against composite alternatives. Historically Student (1908) proposed the 
problem of testing H 0 : 8=9 0 against H\: 9*9 0 in the N(6, cr 2 ) model where 
o 2 is not specified under either Hq or H\. As mentioned at the end ol Chapter 
8, the methods books recommend the test based on well known student’s 

Vn(x - 8 0 ) _ a ^ ^ ^ _ o 

statistic, i 2u ~ ~ *»-1 where x = — 2 x h and S = 2 (x t - x) 2 , which 


’ -yJS z /(n- 1 ) . n 

rejects H 0 for large values of t 2 _,. 

Since r„_, € (- «>, oo) and its distribution under Hq is symmetric around 
zero Student’s r-test rejects H 0 if observed value of t 2 _j > t 2 _ ll _ a/2 where 
f n-i.i-a /2 is the 100(1 - aJ2)% point of the Student’s t distribution with 
(n - 1) d.f. We now show that the Student’s t test is indeed a LRT. 


( i V 

Now here L(jc, 9, cr 2 ) = 1 

W2/rcr 2 ; 

: o 7 / • 


2 (Xj - 0) 2 1 * 

eXp l- 2e~ 2 -1 and 0 = 


<y - S~/n so that 


sup L(x , 9, g 2 ) - 

0€/?i t <T 2 >O 


Now under Hq we have 


exp\-4 


A>(*. 9 q , g 2 ) = 


<*o = \ 2 (*/ - 0o) 2 


exp <- 


E (*/ - 00 ) 2 


sup L(x, 8, a 2 ) = 

0 = 0O,ff 2 >O 


Thus A(jc) = 5_ I 

l <4 


Now LRT rejects H 0 if X(x) < c which is equivalent to 4^ < c,. 
BUt "f • = 2 {Xi ~ 0 o) 2 = n&* + n(x- ®o)J. Thus A(x) < C| is equivalent 

“ - 'T e "' * k .. , 2t ,» 


it n - 1 n )' ^ ow - £3 1 9 0 , ct 2 ] does not depend on 

<T and therefore we have k 3 = t 2 _ u _ a/2 , and Student’s t test and LRT test 
are equivalent. 


S 2 S 2 (n-\ 





Tests of Hypotheses-II 209 


Historically whether the observed sample men i is significantly different 
from the hypothetical population mean was tested by the absolute dttta 

h„l meaSUr l " UnUS ° f ““ de — ^ * • -ame.y 

Thus for known cr 2 the test statistic was IZ J = ^ Z e o) and using 

normality of Z„, the null hypotheses was rejected if I Z„ I > or 
l\ * <5 \-ai 2 = Xi,i-a- When cr 2 is not known, it was recommended that we 

'2/„ t/ _ o) 


estimate (T 2 by 5 2 /n and use Z' = an d reject if I Z' I > 

which is valid for large samples in view of the fact that Z' N(0, 1). If n 

is small, Student (1908) recommended using r„_, = ~ 0Q > and obtained 

the exact sampling distribution of r„_i and its critical points. The most important 
property of student’s statistic t n _\ is that its distribution under Hq does not 
depend on the unknown nuisance parameter <r 2 . This further implies that 

Teo^n-i - ^-u-a/ 2 ] = a f° r any o 2 > 0 or if j3(0, <r 2 ) denotes the power 
function of the test, then /%6q, & 2 ) = cr, V cr 2 > 0. This property of the power 
function of a test is called as similarity. The corresponding critical region is 
called as a similar critical region and the corresponding test as a similar test. 


Remark 9.1.2. In general, suppose Hq : 0 € O 0 is composite hypotheses 
where 0 = (0 b ..., 0 TO )'. Then a test (f> is called a similar test of size a , if 
Ap(0) = a, V 0 € Qq- By using advanced techniques one can show that the 
Student’s test based on t n _\ is UMP test within the class of all similar size 
Of tests defined by S a = {(p I f$ 9 (6 b, o 2 ) = a}. We will not pursue the problem 
of obtaining UMP tests within the class of similar size a tests but refer to 
Lehmann (1959) for details. 

We close this section with some additional examples to demonstrate how 
the approach based on LRT leads to some of the well known tests that the 
reader may have come across in the methods course. 


Example 9.1.5. Consider a situation where we have ^-samples from k normal 
population with possible different means (Mi» Mfc) and common variance 
o 2 - Our object is to test the homogeneity of the k samples i.e. Hq : = 


M2 

<T 2 


= ••• =/4and H { is M, * j Uj for at least one pair (/,;). The common variance 
is unspecified under Hq as well as H\. 

We then have the following setup. 


Xy = / 1 , + Ey, j = 1 , 2 , ..., «/»i - 2 , ..., k 

* here {%} are i.i.d. N( 0, 0 2 ). Here 0 = (Mb M 2 . •••. Mb <?)' and under H 0 , 
e = (M, ..., n, o 2 )' and fto u Qi = A = R k x R +- The likelihood of the data 
= 1, 2, j = l, 2, ..., it is given by 


210 A First Course on Parametric Inference 


r\ 


? (*y “ M/) 2 

£i 2 <t 2 . 


The MLEs are //, = *, and £ 2 = jj S S (*,>• * *,) 2 where /V - 


sup L(x, 0) = r—xf ex P {“ ^/2} 
eea V v2n(J ) 


The MLEs under H 0 are given by 


k nt 


1 K '*/ 

u=7tI E x u = *, the grand mean 
N i=i y=i y 

<*o = jl 2 I, (*.)• - ^) 2 

iV /=1 ;=1 


so that sup L(x, 6) = 


sup L,yx, u) = exp 

0eO o -J271Gq I l 2 J 


The LRTS A(x) = and A(jc) < c is equivalent to f-^rr > c ,. 


We now use the well known decomposition of sums of squares (SS) namely 


k ni 


,5, x ) 2 = ,5 ^ (*» - *i) 2 + 2 n,(x t - x) 2 

that is, Total SS = Within samples SS + Between Samples SS, which 
gives Na^N^ + ln^-x) 2 . Therefore |f 2 c is equivalent to 
2 m(x t - x) 2 

J % ~ ” c 3' Now under H 0' usin g Cochran’s (1934) theorem one 

h k {Xij " Xi) 

can show that the numerator is distributed as and the denominator is 

“d ‘ hCy a ' e mde P endeM . F °' relevant distribution theory reference 
may be made to Rao (1973) and Kendall and Stuart Vol II (1967) 

Thus the LRT in this case is equivalent tfu^, „ . • 11 1 1 *o7). 

equivalent to the well known F test in the one 

way analysis of variance, given bv % k ~ l / %N-k XT 
^ K y k - 1 / JTZ T- Note that in this case 

«w 0 = x L, Q = Q„ mQ,, - d vD . 

«o Uh x - R k x R + and and H { are both 





Tests of Hypotheses-II 211 


composite. The F test is also a similar test i™ tk * * a 

„ tic f;. c Rj,, w * iar lest m t ‘ lat »ts power function under 

H 0 satisfies p^p, &■) = a, V (n rfv c n * . , . . 

„ t n k rt v 0 • € Again one can show that the 

Tsfs For k = 2 th 1S T DT • leve ^ a test the class of similar level 

a 6 , , e 1S ec l u ivalent to the well known two sample 

-2 test which rejects H 0 if 


ni + «2 


C*i -x,) 2 



> t 


2 

n\+n2-2,A-a/2 


,__2 _ (^1 + (tl2 ~ 1),S 2 . o 

where -is the pooled unbiased estimator of o 

and sf and are unbiased estimators of <j 2 from the two samples respectively. 
So far we have been lucky in that the LRT led us to a test statistic whose 


distribution under Hq is well known and even well tabulated for small samples 
as well. In the next section we will consider the famous Pearson ^ test of 
goodness of fit and show that it is a large sample approximation to the LRT 
for this problem. We will then show how similar techniques could be used 
to obtain large sample tests in many other situations. 


Exercise 9.1. (i) Let (X 1? .... X n ) be i.i.d. exponential with mean 6. Obtain LRT for 

testing Hq . 0< 1 against H i : Q > 1 and show that it coincides with the UMP test for the 
problem. Obtain LRT to test Hq. d = 1 against H x : 6 * 1. Show that it is equivalent to 
the test which rejects H 0 if T = £ X, g (a, b) where (a, b) are such that G n (b) - G n (a ) = 
1 - a. Show that if ag n (a) = bg n (b) where G„ and g„ denote the d.f. and p.d.f. of G(n, 1) 
r.v., then the LRT is an unbiased test. 

(ii) Show how results of the above exercise can be used to test H 0 : X > 1 vs H x : A < 1 


in the Pareto distribution with pdf f(x, A) = — 

x 

(iii) Consider the simple regression problem 


X_ 

A +1 


, x > 1 , A > 0. 


y, = a + pXj + i = 1, 2, .... n 

where (x,. x n ) are fixed constants with L(x, - x) 2 > 0. Obtain the LRT for testing 

H 0 : p = 0 vs H x : * 0 when {£,}" are i.i.d. N( 0, a 2 ). How will you modify the above 

LRT if H\ is ft > 0. 

(iv) Let (X lt .... X m ) and (Y u Y 2 . Y„ 2 ) be two independent random samples from 

Poisson distributions with means X\ and respectively. Obtain LRT for testing the 
H 0 : = A 2 against H\\X x ±Xi and also for Hq : A, - X 2 against //, : A, > Xn. The two 

above LRTs are examples of similar tests. 

(v) Refer to Example given in Sec. 7.5 where P(GG) = 6 , P{Gg) = 0(1 - 0) = P(gG) 

and P(g g ) = (i _ 0 ) 2 . Obtain LRT (i) for testing d-j against H { ; 0* 1/2 and (ii) for 

testing 0 = 3/4 against tf, : 0 * 3/4. . u U1 „ , 

The first null hypothesis corresponds to equiprobable cells where as the second null 
hypothesis corresponds to Mendelian hypothesis, with frequencies proportional to 
(9:3:3:!), 



212 A First Course on Parametric Inference 

9.2 Likelihood Ratio Tests for Multinomials 

Consider a situation when we have a random sample of size n on a multinomial 
distribution in fc-cells with observed cell frequencies (n [t n 2 , n k )' such 

that S «, = n. The well known test of goodness of fit due to K. Pearson 

k 

(1900) uses the test statistic % 2 = 2 (Oj - Ej) /Ej where Oj are observed cell 

1=1 

frequencies and Ej are expected cell frequencies under Hq : p, = p, 0 , is\ t 
2, We reject H 0 if the observed value of £ exceeds the critical value. 

Xl-ii-a* the 100(1 ~ a ) % P oint of the chi-square distribution with (k - l) 
degrees of freedom (d.f.). 

We show that x 1 test is an approximation to LRT for testing H 0 : p ( = p.^ 
i = 1, 2, k against Hi : p, * p,o for at least one i = 1, 2, k. 

Since H 0 is a simple null hypothesis 

Sup L(x,p) = L(x,p 0 ) = -jr^— n(p l0 )"' 
p * a o II n,! 

f=l 

Now L(x,p) = - t n ’ II p"‘ 

n m ! ' =1 

1=1 


and 


sup L(x, p) = L(x, p) 

petl 



where p, — We leave it as an exercise to the reader to show that 

P = (£i> •••> PkY is infact the MLE of p = (p h ...,p*)' where p, > 0, i = 1, 2, 
—» k, Sp, = 1. We have already seen in Example 6.3.3 that (p b ...,/>*_,)' is 

CAN for (pi, ..., p^j) with asymptotic variance covariance matrix — A^_i 
where n 


h = Pi (1 - P/) and A, 7 = - Pi Pj . 
The LRTS then is given by 

k k 

Au) = n(p f . 0 )"'7n(p / )"/ 


i=i 


/=! 


or log X(x) = I n, log Using n, = np, we have 

l-l Pj 


(9.2.1) 


- 2 log A (x) = 2n E ^ log 

PiO 


(9.2.2) 


Tests of Hypotheses-11 213 

We now show that RHS of ( 9.2 2) • 

term which converges in probahilit? t0 ^ earson % 2 plus a remainder 

lu y to zero. Put 


y,=Mp, - p, 0 ) or p = i . ,, 

yj + Pio, 1 = 1,2 . 


k then 


- 2 log A(*) = 2n 2 J 

1=1 


f 


II 

In 


\ 


+ PiO 


) 


f 


log 


1 + 


y\ 


\ 


\ 


Pio^n 


Since p, -A p i0 under H 0t -JL n n . __ 

p jQ u unaer H 0 , and we can use expansion 


J 


(9.2.3) 


of log (1 + X) for I X I < 1 . Therefore 


log 


1 + — 

Pio^n 


= ^L r + jL + 0 (l l 

Piovn Pioti \n 


But lyi-0 and after summing up over i we obtain 


- 2 log A (*) = 2nl 


il + 0 ^ 

9m 2 + u 

Zn Pfo 


V«> 


= J ( n Pi ~ n Pio) 2 /npio + £„ 


= A 2 + £„ 


where e„ > 0 under // 0 - As seen in Example 6.3.3, Pearson x 2 has asymptotic 
chi-square distribution with (* - 1) d.f. We emphasize here that 

P - (Pi* •■•■>Pk-\) is CAN for p = (p h ...,p k _\)' with asymptotic variance 
covariance matrix 

n A *-i = “ f k-\(P) wh ere A,-, = p t ( 1 -p,) and A f> - = - p iPj , 


and 


X 2 = (P ~ PoYl” K-\(Po)](Po ~Po) 


-1 


= n(p-p 0 Y I k -1 (Po)(p - Po) 

Now consider a situation in which H 0 can be composite. For example 
such a situation arises in a 2 x 2 contingency table or in general r x s 
contingency table where H 0 is the hypotheses of row and column independence. 

Let AA r be a partition of the sample space according to a factor A 
at r levels and let B { , ..., B s be a partition of the same sample space 
according to a factor B at s levels. Then Cjj = A-, n Bj, i = 1, 2, ..., r,j = l, 
...»$ is a partition of the sample space and let py denote the cell probabilities 
an d rijj observed frequencies in a random sample of size n on the corresponding 

Multinomials. Let p L = I pij - P(Ai) and pj = frPij = P(B } ). Then the 
hypotheses of independence of the classification factors A and B at all levels 


214 A First Course on Parametric Inference 

corresponds to the relations p,y = PiAi n Bj) = P(Aj) P(Bj) - p L x p . y ,• ^ ^ 
2, r,j — 1, 2, 

Now, likelihood of the sample when p,y are such that 2,Zp,y = 1, p {j > 0 j s 
given by 


£(«>/>) = n n 7 n (p,>)" v 

11 Kjj ! /,;' 

i<j 

and that under the null hypotheses of independence is given by 

n\ 


(9.2.4) 


'o(n>P) = i-T , n ( P.i.P.j) nij 
liny!/.; 

/.; 


(9.2.5) 


The MLE’s p,y = — and sup log L(n,p) = const. + X X n,y logp,y . 

n peCl * J 


Similarly MLEs of p,. and p } are — = X n,jln and — = X n ; y/w which gives 

Hi H 


n 


sup log L 0 (n,p ) = const. + X n L logp, + X n , logp,. 
peflo / 


The logarithm of LRTS is 

log X (n) = X n, log Pi + X n j log p ; - X X n t j log p,y 
which can be simplified to 


- 2 log A (n) = 2n Z Z ^ (log - log p, p,.) (9.2.6) 

Now following the techniques used in derivation of Pearson r 2 as an 
approximation to LRTS, we can show that 


- 2 log X (n) - XX 
/=! ;=1 


ny - 


n/.n.y 




n 


n i. n.j 


J 


n 


+ 0(l/n). 


However we will consider a more basic technique. Consider Taylor series 
expansion of log b about log a which is given by 

log* = loga + (h- a )I-| (fc . a)2 ^ + 0(|6 _ a|2 , ii) 


or a(\oga-iogb) = -(b-a) + l(b-a)> £ + 0(1 6 - a |^ )a . 

Tike a = p,j and b = p L p J and expand each term on RHS of (9.2.6). Then 
as XX Py - 1 - X Pi, = £ p j we have RHS of (9.2.6) 


Tests of Hypotheses-ll 215 


ZZ 

i j 


(n 

c* 




f 


rii n 


n ij ~ 


i 




n 


\2 


J 


n i. n .j 


+ £„ 


where s n 



states that Z 

•J 


der H 0 . The reader will recall that the methods course 

~ g //) 2 . , 

Ejj ftas a r distribution with (r - \)(s - 1) d.f. and 


n0 , (rs 1) d.f. In estimating Pl and Pj under H 0 we are estimating (r - 1) 
+ (f - 1) parameters, whereas under H 0 .j H, we are estimating (rs - 1) 
parameters. As a consequence of Pearson-Fisher Theorem quoted below, the 
Pearson t has d.f. given by (rs - 1) - (r - 1) - (s - 1) = (r - l)(s - 1). 
Originally Pearson (1900) indicated that for testing independence in a 2 x 2 
.able the number of d.f. for f is 4 - 1 = 3. However as pointed by Yule and 
Greenwoo ( ) and Bowley (1920) this led to certain inconsistencies. 

Fisher (19 ) in icated that the correct number of d.f. in this case is not three 
but 1 = (2 - 1) (2 - 1). No body at that time thought that K. Pearson could 
be wrong and it appears that Fisher had difficulty in publishing this result. 
However Fisher (1928) showed that if the multinomial distribution in k cells 
has probabilities p, (0) depending on a vector parameter 0 = (0 b ..., 6 m )' then 


'he X 1 = 2 («, - n Pl (6)) 2 ln Pi ($) has a £ distribution with (k-l)-m d.f. 


A 

when 0 is MLE of 0. Thus the d.f. (k - 1) for the simple H 0 : Pi = p Un i = 
1, 2, ...» k are reduced by m, the number of parameters estimated under Hq. 
This is the well known Pearson-Fisher theorem. There were a few gaps in 
the proof given by Fisher but these were filled up by Cramer (1946) and 
Birch (1964). We will not go into details here, but strongly recommend the 
readers to refer to the original papers of Fisher, particularly Author’s notes 
written by Fisher himself in Shewhart (1950) where references to Yule and 
Greenwood (1915) and Bowley (1920) are listed fully. 

We now consider an example to illustrate the Pearson-Fisher Theorem. 


Example 9.2.1. Consider example 7.5.2 due to Fisher (1954) which considers 
a multinomial distribution is 4 cells with cell probabilities 


p,(0) = ^-±-i ) p 2 (0)=P3(0) = L-^andp 4 (0) = 0/4, 0< 0<1 

w here 0 denotes the linkage factor between two varieties of maize classified 
as Sugary and Starchy with colors Green and White. In Example 7.5.2 for 

Carver’s data we hae seen that MLE 0 = .035712 and therefore the expected 

frequencies under the null hypotheses that the model is true are given 
by 


216 A First Course on Parametric Inference 


n 


r 2 + 0 ^ 

v 4 y 


= 1943.60, n 


Vi' 

^ 4 J 


= n 


f i - 0 ^ 

k 4 y 


= 920.65, Of = 34.10 


where n = 3819. * 

On the other hand np, = 1977 = np 2 = 906, np 3 904 and np 4 = 32. This 


gives 


and 

sup L(n,p(d)) = -r^-K «(Pi (»))"' 

0€0o 71 ^ n '' } 


so that 


- 2 log A (n) = 2n Z p, log ■ /Vp,(0) 

1=1 


= 2 £ n, log —^ = .6660 

'=i npf(6) 


4 

On the other hand Pearson y 2 = Z (0, - E t ) 2 !E { = 1.2375. We observe that 

/=i 

even for n = 3819 a fairly large sample size, the approximation to - 2 log A(n) 
by Pearson is not very close. 

Since k = 4 and m = 1, Pearson ^ test statistic has 2 d.f. Suppose a = .05 
then we reject the model specified by H 0 if observed value is greater than 

the critical value xli-a = 5-99147. In this case therefore we do not reject Hq 
either on the basis of Pearson x 2, or LRTS. 

If Pearson £ test statistic is regarded as a measure of discrepancy between 
the observed multinomial distribution and the hypothetical multinomial, then 
probability of obtaining a value of Pearson x 2 as large or larger than the 
observed value can be regarded as a measure of goodness of fit of the model 
with the observed data. Nearer this value is to zero more strongly we should 
suspect H 0 and nearer this value is to one, stronger is the support in favour 
of model specified by Hq. For the Carver’s data the observed Pearson x 2 

1.2375 and P[xl ^ 1.2375] - f f 2 (u) du where f 2 (u) is the pdf of %\ 

J 1.2375 

which is same as that of an exponential distribution with mean 2. Thus the 
above probability is given by -y | e~ ul2 du = .5386. 

L J 1.2375 

From Abramowitz and Stegun (1964) (Table 26.7) by interpolation one 
obtains the value .5388 which is quite close to the exact value. The above 
probability is called as the observed level of significance of the Pearson ^ 
statistic and in the general case it would be 





Tests of Hypotheses-II 217 


> observed value of Pearson £ 2 ] 

If this probability is less than a the snerifi^i. i * • •in¬ 
tolerable type I error, then were,^ ! significance or max,mum 

Tt,» p»«,™ fj- i , e reject n 0 otherwise we accept H 0 . 

, a continuous A Yu e ° rCm can a ' so be used to test the model prescribed 
y is ri ution. This is achieved by grouping the data into 

number of cells with cell probabilities Pi (8) = J f(x, 8) dx. If the Pearson 
**. test statistic rejects % i, e . Pi = p , <0) , , = fa_ _ k then this implies 
rejection of the model specified by the class of pdf given by {/(*, ff), 0 e £1). 
Note that here both , and 8 can be vector valued On the other hand if we 
have a discrete random variable with P[X = x] = Pl {0), / = 1,2. k then 

eae c, can e ta en as or we can still have a grouped data. We illustrate 
this technique by examples. 


t?n AM /wLn\ Consider the data obtained by Rutherford Chadwick and 
is ( ) reported earlier in Chapter 1. Suppose we want to test the 

ypot eses that the model is described by the Poisson pmf with parameter 
A > 0 unspecified. Following Cramer (1946) (Table 30.4.1) we obtain: 


X 


nPid) 

0 

57 

54.399 

l 

203 

210.523 

2 

383 

407.361 

3 

525 

525.496 

4 

532 

508.418 

5 

408 

393.575 

6 

273 

253.817 

7 

139 

140.325 

8 

45 

67.882 

9 

27 

29.189 

> 10 

16 

17.075 


The MLE X = 3.87 and Pearson = 12.885. The d.f. of Pearson are 
11-1-1=9 since under H 0 only one parameter is estimated. The critical 
value of x\ at a = .05 is 16.919 and we accept H 0 or say that Poisson 
distribution model holds. The observed level of significance is obtained as 

p value = P[xl - 12.885] = .17 

Thus although the data supports the hypotheses of Poisson distribution this 
support is not as strong as that for H 0 in the following example. 

Example 9.2.3. In his famous novel “Jurassic Park” Michael Crichton 
(1990) presents a statistical analysis of the data on the heights in cms of 68 



218 A First Course on Parametric Inference 

procompsognathids bred in the Park and leased in three batches at six 
months intervals. The statistical analysis done by Malcolm, the information 
scientist of the team shows that the table presented below has frequency 
polygon which very closely resembles a perfectly symmetric normal ditnbution. 
Malcolm argues that this resemblence constitutes a proof that the dinosaurs 
were breeding at random because only then the distribution of their heights 
will produce a normal curve. On the other hand since three batches were 
released at six months interval the data on heights should more look like a 
mixture of three different distributions if there was no random breeding. We 
will examine this analysis using Pearson-Fisher theorem and application of 
Pearson to a grouped data for testing Hq : Data is a random sample from 
N(p, o 2 ), both (j u, o 2 ) unspecified. 


Table 9.2.1. Height of 68 “Compsos” (cm) 


Height 

Oi 

npi(fi, & 2 ) = F, 

< 29 

4 

5.91 

<30 

4 

4.39 

<31 

6 

5.95 

< 32 

9 

7.64 

<33 

10 

8.71 

< 34 

10 

8.85 

<35 

9 

9.08 

< 36 

6 

6.44 

<31 

4 

4.92 

> 37 

6 

6.06 


68 

68.00 


The MLE are ft = 33.15 and <7 2 = 3.05. Pearson j? = 1.4184 and is to be 
tested at 10 - 1 - 2 = 7 d.f. as two parameters p and o 2 are estimated. Now 

^ 7 , 95 = 14 067 and we do not re j ect H o- The p -value of the observed is 
given by P\x] - 1*4184] - .9824 which very strongly supports the hypotheses 
of normality. Hence the incontrovertible conclusion is that dinosaurs are 
breeding at random. This later turns out to be the fact and there is a plausible 
explanation of how this could happen due to certain possible properties of 
DNA sequences used in breeding “Compsos” in the laboratory. The data is 
fictious but the ingenuity of the novelist and his use of statistical thinking 
is very noteworthy. 

We refer to Cramer (1946, Sec. 30.2) and Fisz (1963, Example 12.4.2) 
and Rao (1973, Sec. 6b) for some real life data (grouped) to test normality 
by using Pearson statistic and Pearson-Fisher Theorem. 

Exercise 9.2. (i) As per the Mendelian hypotheses H 0 : /?, = 9/16 p 2 = 3/16, Pi = 3/16 
and p A = 1/16 where the four cells correspond to 2 x 2 classification of peas according 
to (Round and Angular) x (Yellow and Green). For the data n { =315, n 2 = 108, /i 3 = 101 




Tests of Hypotheses-ll 219 


aIld ? Han^VDOthesis^sTtron^^ Sh ° W ^ the ^' level is between 90% to 95%. The 

I ' /lC f-Hs too good to be true'? Thn S '! PP ^ led by the data - Should one then suspect data as 
the fit is too good to be true? Thus for £ test, should one suspect very small value of y 2 

„i, 0 ? The common view is to believe in th^ hrm**. «.. F , * 

a S ° u mnrUi k lt , n tne honesty of the concerned experimenters and 

rej ect the model spec fied by H 0 only for large values of Pearson y 2 . 

00 rt 3 nr iStribUti ° n in 4 cells with Pm = £ H* = «1 - 6) = 

p3i 6) and / 4( ^' 0 ( ^ efer Sect ion 7.5). Obtain Pearson £ test for this model. 
Suppose for a d of 1000 observations the observed cell frequency are 558, 192, 178, 
72 . What is the p-level in support of H 0 that the model is true? 

(iii) Consider a 2 x 2 table discussed by Yule and Greenwood (1915) to study the 
effect o anutyp 01 an anticholera inoculations on incidence of these diseases. The data 
is taken from Kendall and Stuart, Vol. 2 [(1967) Example 33.1], 



Not attacked 

Attacked 

Total 

Innoculated 

276 

3 

279 

Not innoculated 

473 

66 

539 


749 

89 

818 


Calculate Pearson £ and test (at oc = .5) the hypotheses of independence between the two 
attributes namely being innoculated and not attacked. What is the p-level of the observed 
Pearson £ 2 ? 


9.3 Large Sample Tests 

Consider testing H 0 : 0 = 0 O against H { : 0 * 0 O on the basis of a random 
sample of size n with pdf belonging to one parameter Cramer family given 
by {f(x, 0), 0 g £2 c R \} then 


- 2 log A (x) = 2[log L(x , 0 ) - log L(x , 0 O )]. 

a 

Now expand log L(x, 0 O ) around 0 by Taylor series to obtain 


r 


log L(x, 0 O ) = log L(x, 0) + (0 q - 0 ) 


0 log L} 
00 




(0Q ~ 0) 


( 3 2 




d log L 

00 2 


J e=e 


+ £ 


where e„ = 0(1 0 O - 0I 2+5 ). Noting that 


01ogL 

~~W~ 


\ 


= 0 we have 


f 32 


- 2 log A (x) = -(0o- 0) 


a .. 0 2 logL 


V 


00 


/ e=e 


+ £ 


220 A First Course on Parametric Inference 


_] (d 2 log L) 

= nl(6 0 )(e 0 -d) ■ n/(0o )^ de i 
= UV + £ n 

where U = nl(0 0 )(0 0 - 0) 2 -U x\ and V = 

view of Cramer-Huzurbazar Theorem and i 

Thus - 2 log A A Xi under H 0 . Therefore, LRT rejects H 0 if- 2 log A(*) > 
Xi,i-a ~ %\-a /2 where £ a is 100a% point of the standard normal distribution. 
The result can be generalized to w-parameter Cramer family to show that 

— 2 log A(x) — U m V m + £ n 

where 

U m = n(d- OoYnOoKe-Oo) (9.3.1) 

and 



Vm = (Bo) 


<2 2 logl V 
K n dO r dd s JJ S 


As yln(d - d 0 ) -U N° n) (0,1 \9 0 )) we have U m -U Further V m I mxm 


the identity matrix of dimension m. Thus - 2 log A (x) Xm and we reject 
H 0 : 6 = Q 0 if - 2 log A (x) > Xm,i-cr The approximate /7-level is given by 
Plxl > - 2 log A(x)]. 

A large sample test equivalent to the LRT derived above can be obtained 

- . s( \ \ 

by using the fact that under H 0 , 6 ~ AN (ff,) 0 0 , - 7 _1 (0 O ) . Thus if observed 

V w i 


deviation between 6 and hypothesized value 0 O is large we have an occurrence 
of rare event and we reject H 0 . The deviation between 6 and 6 0 is however 

not measured by the usual norms Z (0, - 0 O ) 2 or Max I 0, - 0 /o I but by the 

/ 

norm defined by 


U m = n{9 - 0 O )7(0 O )(0 - 0 O ). 


Wald (1943) proposed that instead of using 7(0 O ) one could use its estimator 
given by 7(0) i.e. approximate - 2 log A(x) by n(G - 0 O )' 7(0)(0 - 0 O ) and 

we reject 77 0 if «(0- 0 O )'7(0)(0- 0 O ) >xl,i- a - Here 0 is MLE of 6 
under 77 0 u H\. 

The LRT as well as Wald’s test requires explicit evaluation of MLE 0 and 


Tests of Hypotheses-Il 221 


the inf° rmat * on matr * x at % or 0. One can obtain a large sample test based 


on 


t he score functions —f l 


dlog L d log L 


00 


) • • M 


00 


m 


thai 


_1_ 

f 0 log L 


de 


, using the fact 


is/UV< m) (O,/(0o)) 


Je=e 0 


Q{0 o) = 7 

• v 


f 


d\ogL\ 


V 


00 




' 0=00 


01ogL 

00 


\ 


'0=0o 


(9.3.2) 


has asymptotic xl distribution, and we reject H 0 if Q(6 0 ) > xl,i-cr This is 
known as a score test and was proposed by Rao (1947). In the score test one 


fru *--* Vv~ l 


can use 




0 log L 

d$ r de 


a(fl») = 


5 'Je-H 
r d log L 


instead of i/->(»„) to obtain 

ft 


V 


00 


0=0 O 


(f 


w 


0 2 log L 

00 r 00 s 


W 1 


'Je=6o 


( 


01ogL > \ 


00 


(9.3.3) 


se=8 0 


The score test given by Q\(6 0 ) does not require explicit evaluation of 0 
and I~ l (0 O ). The asymptotic distribution of the test statistic Q(6 0 ) or Q\(9 0 ) 
is still Xm under H 0 and we reject H 0 if the observed value of the test statistic 
exceeds xIa-ot 

Example 9.3.1 . Let (jc lt ..., jc„) be a random sample of size n from Cauchy 
distribution with location fl. Then 


log L = - n log 7T + £ tl + (x~ ii) 2 ] 

01og L _ y 2(x,~ - fi) 
dn ~ /=! 1 + (Xj - ju) 2 


We also know that I(n) = Suppose we want to test H 0 : n = 0 against 
then-21og A. the LRT statistic is approximated by f (/i) 2 where 

M is MLE of n and we reject H„ if § (fi) 2 > *u-« • This re 9 uires 
e *plicit evaluation of U which depends on the method of scoring for 
Parameter given in (7.6.3). If instead we use (9.3.2) we reject H„ if 


n 


log L) 2 
{ 0 // 




M=0 



> X\\- a which avoids explicit 


222 A First Course on Parametric Inference 


evaluation of fi. Since l(fl) = j we do not recommend using 2i(rt,) gi Ven 
by (9.3.3) since it requires evaluating 


d 2 log L 

-1 

A 

2(1-x?)] 

d, u 2 

u= o 

[w (1+^ 2 ) 2 J 


Example 9.3.2. Consider a two parameter Gamma distribution with pdf 


given by f(x, a, 

H 0 : a = 1, A = 1 
then we have 


A) = + "j j e our ^ 1 » x > 0, a > 0, A > 0. Suppose 
i.e. under H 0 we have a standard exponential distribution 


d log L 
da 


= -Z 


+ n — 
a 


r 


d\og L 
da 


\ 


' a=l, A=1 


= -Z 


X; + n 


dlogL 


= - nipt A) + n log a + £ log 


^<9logL^ 

~dT~ 


'a=],A=1 


- nyr(l) + Zlogx, 


where y/(A) = log r (A). The information matrix is 

I(xa ~ n ~2’ hx - ha ~ ~ 2 anc * hx - A) 

Hence the test statistic <2(0 O ) given by (9.3.2) now becomes 


[n - Z x h Z log Xj - niffQ)] 

n - n 

1 

3 

1 

M 

_1 


- n ny/'(l) 

|_Z log x, -ny/( 1)J 


From Abromowitz and Stegun (1968) ^0) = - / (Euler’s constant) and 
y/ (1) = n /6 and for observed data we can calculate Q(d 0 ). 

Here also MLE (a, A) has to be obtained by iterative procedure given ir. 
(7.6.5). 


We recommend to reader to carry out a small simmulation experiment and 

compare values of the test statistics Q(0 O ) and U m and compare the labour 
involved in computing them. 

In the above discussion we considered the situation where H n ■ 9 = 0h 

H W vir mple , hy r heS,S , M ° r ' ° ften than not we are in «rested in the null 
hypotheses which specifies only a few components of 0 say H 0 : 0, = 0,„. 

* 9k0 and where the remaining components of 0 are unspecified and act 

as nuisance parameters. We have seen examples of ,k . , , , „ 

Sections 9 l anH o o rp *ii . exam P les °f this type of prob ems in 

actions 9.1 and 9.2. To illustrate the situation consider testing H 0 : A = 1 


Tests of Hypotheses-ll 223 


w , lh « unspecified in Example 9.3.2 considered above. This is similar to 
testing Wo • 6- 6 0 fj unspecified in N(8, a 2 ) model for which we have 

S first cnnsiH 'of ent l ° *** we ^ known one-sample Student’s t test. 

We rom a nonnf r , Pr ° b ‘ em ^ <*«• » random sample of 

,, fl (j) (2 f 10n wit ^ Pdf belonging to the m- parameter Cramer 
family f/(*. *». (®u> *»> € x ^ c SJ an P d H# ; = 

w ,th 6 acting as a nuisance parameter. Under H 0 , £„(*, fl<», 0®) is the 

likelihood of the sample for fixed * and 6® s Q m _ k . Let 0® be the MLE 

of 0® when 0®> = 0® and let (0<», 0®)' be the MLE of 8 = (0®, 0®)' 
under //q ^ ^i* Then LRTS is given by 


2 log A - 2{log L(x, 0 (1) , 0< 2 >) - logLU, 0^» ^o 2) )l ( 9 - 3 - 4 ) 

It can be shown that the asymptotic distribution of - 2 log X(x) as given 

in (9.3.4) is % k . We will not go into a detailed proof of the result but 

a h f l !” S ! 1 , C ar § ument below. The interested reader may refer to Rao 
(1973), and Wald (1943). 

Consider 6 0 - (0 O , 0 0 ) e £2// 0 . Then RHS of (9.3.4) can be written as 

- 2 log X(x) = 2 [log L(x, 0<>\ eW)-\ og L(x, 0< 1 >, 9™)] 

- 2 [log L(x, $£\ G { < 2 >) _ bg L(x, G$\ 0?)] 

On expansion in Taylor series around (0 (1) , 0< 2 >) in the first bracket and 
(W- @q 2) ) in the second bracket we have 


-2 log Mx) = n(G- 0 0 yi(0 0 )(6-G 0 )~n(G < 2) - 0™)' / 22 (0 o) (0< 2 > - G< 2 >) 

(9.3.5) 

where information matrix is partitioned as 


W o) = 


Ai(0o) 

(kxk) 

^ 12 (^ 0 ) 

h\(0 0 ) 

^ 22 (^ 0 ) 

(m - k)(m - k) 


Under (0< n , 0< 2) ) e Q H{) , the asymptotic distribution of the first term on 
of (9.3.5) is xh and of the second term is Xm-k‘ Thus - 2 log X(x) 
for large n has asymptotic distribution as that of %\ - x 2 m _ k . It is thus 
tempting to guess that such a difference should behave like xl in view of 
® additive property of independent chi-square variables, namely 
X 2 m _ k . If Cochran’s Theorem (1934) on quadratic forms in normal 



224 A First Course on Parametric Inference 


variables hold, [See also Cramer (1946)] then we can conclude that 
xf n — Xm-k indeed %%. A delicate argument given by Wilks (1938) shows 
that the Cochran’s theorem holds for the asymptotic distribution and indeed 

one can show that - 2 log A (x) xl- 

We can generalize this result further where the null hypotheses states that 
9 = (0 h ..., 0J satisfies k functional relations yf\{9) = c i» •••» VkiQ) = c k 
where y^, .... yr k are such that there exists function yf k+ 1 ( 0 )» •••» Vm(6) so 
that 


d(Vb-> ¥m) 

d(0 h ..., 0 m ) 


* 0 for any 0 e Q i.e. y/(0) = (y/\, 


one transformation. 


\l/ m Y is a one to 


The question of obtaining a large sample test for H 0 : y/|(0) = c, where 
c, are known constants i = 1, 2, k, can now be solved by using a 
reparametrization 9 ->• y/ which is one-to-one and differentiable. Let 
W- (Yb •••» VmY ^ ‘Pand let {L { (x, y/), y/e *Pc R m } be the transformed 
family of pdfs. Then H 0 specifies.first k co-ordinates of y/to be (c\, ...» c k )' 
and y/ k+ i, ..., y/ m are now nuisance parameters. We then have the LRT in 
terms of MLEs of yr under Hq and Hq u H { . Using invariance of MLE 
under non-singular one-to-one transformation, by transforming back to 9 

from y/, the LRTS and - 2 log A(x) would be asymptotically xl under H 0 . 
We illustrate this with a problem for test of homogeneity for several 
exponential distributions. 


Example 9.3.3. Consider testing homogeneity of m exponential populations 
with samples of size n, each so that 


L(x, <7|,. 


m 

i * r n/ 

exp \ X XijlGi 


, <o = n < 


i=i 

[<' [j =i J 

> 


and H 0 : <Ti = <r 2 = ... = <T m (= a unspecified). 

1 >U n * T 

Routine calculations show that under H n , &= 4r X X x = — where 

N l= i jTi 'J AT ,1C 

1 n ' T 

N = ltij and <7, = — I x tj = A Thus, LRT is given by - 2 log A(x) = 

m 

2(Wlog a- X n, log tr,} which is asymptotically xl-i- Since under H 0 

we have m - 1 relations specified, the LRTS is xLi- Another equivalent 
way to specify the degrees of freedom for - 2 log l(x) is to subtract the 
number of parameters estimated under H 0 (1 in this case) from the number 
of parameters estimated under H 0 u tf, (m in this case). 

Example 9.3.4. Consider a system with two components (C,, C,) such that 
to start with failure time distributions of C, and C 2 are both exponential 




Tests of Hypotheses-ll 225 


with failure rates a and (5 respectively. -If Cj fails first the system continues 
to work until C 2 fails but the failure time distribution of C 2 now becomes 
exponential with failure rate ft. Similarly if C 2 fails first then the system 
continues to wor until Cj fails but the failure time distribution of C\ now 
becomes exponential with failure rate of. Let (X, Y) denote the failure times 
of components C { and C 2 then the joint pdf of (X, Y) was obtained by 
Freund (1961) and is given by 

log/ - /a(*> y) [log orft - log of'/J] + log a'f} - a'x - (a + P - <x')y 

+ «x' + P'-a-p)( x -y)l A ( x , y ) (9.3.6) 

where A - {(x, y) I 0 < x < y } and l A ( x, y) is the indicator function of the 
set A. One can show that the above pdf depending on 6 = (a, ft a', ft)' 

belongs to a four parameter exponential family with Fisher information 
matrix 


7(0) = diag -J-1_ P _ a 

|_a(a +PY P(a + py (a') 2 (a + py (p') 2 (a + p) 

The minimal sufficient statistic is four dimensional and is given by 
X y) = n [ = number of observations with x < y i.e. where C[ failed 
first, together with X x, £ y, £ (x - y) I A (x, y). The MLE’s for a random 
sample of size n on f(x, y, a, ft a ', ft) are given by 

a = _ n i ft- n i 

Xmin (x, y)’ Xmin(x,y) 


A , 

a- 


r= 


Xx - Xmin (x,y)’ Xy-Xmin(x, y) 

We urge the reader to work out details which are simple but interesting. 
The MLE’s are CAN for (a, ft of, ft/ with variance covariance matrix 

^ given b y 

~diag a(a + p),p(a + P), - ( ^ + ^ ) 

/ -1 101 

Freund (1961) had obtained the MLE but did not obtain 1 ( 6 ) or —— or 

/I 

did not note that the pdf in (9.3.6) is a four parameter exponential family. 
These results were derived by Hanagal and Kale (1992), in which they 
consider testing hypotheses of symmetry Hq : (X = p and of = ft. Then the 
pdf under Hq is given by 

fo(x, y, ft, 0 2 ) = 0,0 2 exp {- ft I x - y I - 2ft min (x, y)} (9.3.7) 

with 0. = H a nd 0? = —--: and the asymptotic variance 

2 X min (x, y) Xlx-yl 




226 A First Course on Parametric Inference 


covariance matrix of (0 lf 6 2 Y as diag 
0 2 = O' = 0'. 

Since explicit MLE’s are available 


'e[ ef 

n ’ n 


, where 9, = a = 0 and 


K = L(x,y, a,fS, a, f}')ILo(x, y, § 2 ) 


can be computed. 


LRT will reject H 0 if - 2 log X > %\,\-«• The degrees of freedom for 

- 2 log X is two, as under HqKJ H\, we estimate four parameters whereas 
under H 0 we estimate two parameters. 

Hanagal and Kale (1992) suggest using the fact that 


a-p 
|a'-/TJ 


~ AN (1) 


r a- P \ a + fi ( ~' 2 


a'-p'j n 


diag 




) 


and construct the test statistic 


T _ n(a- py_ + n(g'-p Q- 

(« + ^) 2 . aJd /2 .i8' 2 


(a+/3) 


a 


and reject H 0 if 7] > 

The test statistic T { is a good large sample approximation to - 2 log X 

, A A 

and follows Wald’s suggestion to use /~ l (0) where 0 is MLE of 0 under 
H 0 u //,. 

Hanagal and Kale (1992) also give a test for H 0 : a = a' and P = P' 
which is equivalent to independence of jc and y in the Freund model. 

Suppose we are testing H 0 : 0, = 0 /o , i = 1, 2, ..., k on the basis of a 
sample of size n when the pdf belongs to the m parameter Cramer family. 
Then one can obtain a test for H 0 against Hi : 0, * 0, o for some i based on 

• A A A 

a CAN estimator 0 = (0 b ..., 0 m ) with asymptotic variance covariance matrix 


A. Partition the matrix A as 


An 

A21 


A|2 

A 22 


then 


0 (,) = (0,,..., Q k ) ~ AA^ ( *^0 (l) , 


Following Wald (1943) let 


W = n(0 (1) - 0< 1) )'A n (0)(0 (l) - 0< o ) (9.3.8) 

A large sample test based on W rejects H 0 if observed W > xl.\-a- In 
general if Q is MLE, the test would have better power performance than 



Tests of Hypotheses-H 227 


any test based on other CAN estitnat 

constructing the quadratic form ro , ° rS ' “ >S ™ P ° rtant to note that in 
parameter 9 under H 0 u u In V . 3 ®) we use 9 vqhich is CAN for the 
illustrate this point consider test* ^ ^ followin S Student (1908). To 
tf(/l, O 1 ) model where cr 2 i! mg /f ° : ^ = /A) against //, : /i * ^ in 
A ,, , i , a nu,sa nce parameter. Then consider 

^ = (x, S 2 /n) with -L A = diag f 2 i 2 o^ 

‘ n n '• 


Wald statistic given by (9.3.8) 

is W = n(x- n 0 ) 2 /$i „ y i f 

A n 1 r 8 e Nuisance parameter o 2 is estimated 

by c 2 = 5 2 /n which is CAN fo 2 

use a CAN estimator of n* W „ Under W ° as wel1 as H >- Suppose we 

“nder tf 0 only say aj = I (x, - ^/n. Then 

modified test statistic is given by w - = which „ also j 

for large „ under H 0 . However as I (*, _ $ if* - x) 2 + n(x _ 

W' < W and there will be several , ^ ’ 

. c that ' eral sa mples for which W' < %\x „<W. This 

shows that the power of IV vua„u u i A ’ “ 

one cannot anLnt* tk W Z d be larger than the P ower of W '- However 

In general we ^ &t * 1S k*™ 1 ° f situation wil1 occur m every case. 

estimator with reCOmmend that in constructing test based on a CAN 
estimator with vanance covariance matrix A which involves nuisance 

parameter one should evaluate A at 9 the MLE or any other CAN estimator 
I few examples' " ° "" % this Unique by way of 

Example 9.3.5. Consider testing equality of probability of occurrence of 

an event £ in two independent Bernoulli series of trials. Thus the likelihood 
ot the data is 


L(x, y, e u 0 2 )= 0? Xi (1 - 0,)" 1 ~ Zxi df yi (1 - 0 2 )«2 - 2 .V, 
an d we have a two parameter exponential family with the MLE’s 

A ] A 

0 i = -~’0 2 = -^ L which are CAN with a variance covariance matrix, 

A = dian ^2 (1 _ #2)^ yi , 

n{ * /l2 J* ^ ere w n» n 2 \ denotes the number of 

successes in the two samples respectively. Note that for CANness to hold 
We mus t have 00 as well as n 2 -> «*>. It immediately follows that 


0j -0 2 -AN 


f 


n 2 


Now for H 0 : 0, = 0 2 we reject H 0 if 


T o = (0) - 0 2 ) 2 


^ 0 ) (1 — 0 ] ) ^ 0 -^ 2 ) ^ 


V 


n i 


«2 






a/2 



228 A First Course on Parametric Inference 

if the alternatives are two sided. If H x is 0, > 0 2 then we re J ect H o when 

1/2 


t x = ( 5 , - e 2 y 


<9,(1 -0,) 0 2 (i-^2) 




n 2 


>£l-a 


Note that we use the estimator of Var (0, - 0 2 ) which is consistent 

A A 

under H 0 u H v A consistent estimator of Var (0, - 0 2 ) under H 0 is 


0 ( 1 - 0 ) 




some 


_L + where 0 = 11 — the MLE of 0 under H 0 and 

v n, n 2 J n x + n 2 

books on statistical methods recommend its use. 

We have considered variance stabilization transformation for CAN 

estimators due to Bartlett (1947) in Exercise (3) of Section 6.2. One can 

very effectively use such a transformation for testing hypotheses. For example 

. . 0 ( 1 - 0 ) 

in the Bernoulli series of trials the MLE of 0, 0 = x has variance ^ 

It can be easily verified that the transformation iff(O) = sin 1 V0 is such that 

\p(x) - AV^sin -1 V0, . Thus in the above example we can use the test 

statistic. 



in case alternatives are two sided. For one sided alternatives 0, > 0 2 we 
reject H 0 if 



Fisher-Yates (1963) have tabulated the inverse sine function but now a days 
most hand held calculators have this function built in. 


Exercise 9.3. (1) Analogous to the comparison of proportions from two independent 
samples on binomial distribution consider two independent samples of size n x and n 2 
on two exponential distributions with means <7, and o 2 , respectively. Obtain LRT to test 

H 0 : <7, > 2ct 2 against H { : a, < 2 <t 2 . Here MLEs <7, = Jc, and <r 2 = x 2 are CAN with 


( ~ 2 


asymptotic variance covariance matrix diag 


<r 2 n 


n i n 2 


) 


<t, - 2ct 2 , -L, 


. Construct a test for the above 
4<y. 2 ^ 


Observing that the 


problem based on x x - 2x 2 which is AN 

{ «! n 2 

variance stabilization transformation is logarithmic construct a large sample test based 
on [log d\ - log (2a 2 )] for the above hypotheses. 

(2) Let (x„ y,) i = 1, 2, .... n be a random sample of size n from a bivariate normal 
distribution with parameter $ = (/i„ n 2 , a 2 , a 2 , p)'. Obtain MLE of d. Construct a 



Tests of Hypotheses-11 229 


large sample test for testing ^ 

8 g ® “o • p > p 0 against p < p 0 based on MLE p = r, the 

Pearson correlation coefficient of thf» .. (1 - P 2 ) 2 

r 01 lhe sam ple with asymptotic variance - - -• 


Show 


that Fisher’s transformation - i 1 + r . 

r v ' 9 10 ® i- 1S a variance stabilizing transformation 

and give a large sample test based on this 


transformation. 


9.4 Consistency of a Large Sample Test 

Consider the problem of testing H 0 : 0 £ flj, vs h a : 0 > % when we have a 
random sample of size n from N(6, <r 2 ) where <r 0 2 is known. Then the LRT, 
which is a so UMP test for the problem, has test function q>i(x) = 1 

zero elsewhere. The power function of this test 


if x > So + -p £ i_ a and 
"vn 

is given by 


^«,,(0) = l-0 


*.^♦ 5,-1 


(9.4.1) 


For any 0 e £2 ffl i.e. 0 > 0 O jim P Vi (8) = 1, i.e. the power of test (p\ 
approaches one as n —» «*> for any alternative hypothesis. This property is 
known as the consistency of the test and formally we define a test (p e JD a 
to be a consistent test if —»1 as n —> <» for V 0 e £2#,. Note that here 

ft#, could be two sided and further 8 could be vector valued. For example 
in the above problem suppose we are testing H 0 : 8 = 8 0 vs //, : 8 * 8 0 then 
LRT test of size ot, which is also known to be UMP Unbiased, rejects 

„ n(x - 8 0 ) 2 2 s2 

"o lf T 2 ->£ 1 , 1-0 = Si-a/ 2 - As the reader can verify the power 

<J ° 

function of this test is given by 


0i(6) 


= 1-0 Vn 


(0p - 0) 

ffp 


+ «.- 


a/2 


+ 0 




all 


For any 0 * 0 O , lim ft(0) = 1 and the LRT is consistent. 

If <r 2 is unknown then for the one sided problem H 0 : 0 < 0 O against 
: 8 > 0 O the LRT rejects // 0 if Vn(* - 0 o )/(S 2 /(n - 1)) 1/2 > t n _ u _ a . The 
power function of this test is given by 

ft(0, <y 2 ) = P[Mx - 8 0 )/(S 2 /(n - \)) m > t n _ u _ a | ( 0 , a 2 )].(9.4.2) 

Under //, the test statistic has a non-central f-distribution with (n - 1 ) d.f. 
a nd non-centrality parameter S 2 = (8 - 0o) 2 / o 2 and the power function can 
evaluated at any point (8, <X 2 ) € £ 2 ffl . For details we refer to Kendall 
and Stuart Vol. 2 (1967) and other references contained therein. To obtain 

& 03(6, o 2 ), the RHS of (9.4.2) can be written as 


230 A First Course on Parametric Inference 


(x-9 0 )/(S 1 /(n-\)y u > 


1/2 - ? n-U-a 


Vn 


0 , a 


1? P 

Observe that as n —» x — > 0and (S 2 /(n - 1)) 1/2 > G- Further 

t „_, !_ a £i-a and -4-» 0 therefore the r.v. in bracket converges in 

Vn 

probability to (9 - 0 o )/<7 and we have 


lim &(0, o 2 ) = 0 if 0 < 0 O 

n—> oo 

= ct if 0 = 0 O 

= 1 if 0 > 0 O . 


If we have two sided alternatives 0 * 0 O then LRT rejects // 0 if 

n(x - 0 O ) 2 2 . . . .. 

- -— > t„_, \_fy /2 and using similar argument as above we can show 

S z /(n - 1) ’ 

that for any 0 * 0 O lim /3 4 (0, a 2 ) = 1. 


fl— 


In fact the above argument can be generalized to show that the LRT in 
general is consistent. Consider testing H 0 : 0, = 0 (O , i = i, 2, ...,/: with 
0* +] , 0 m as nuisance parameters. As seen in the previous section, the 

LRT rejects H 0 if - 2 log A > Xk,\-a and Wald’s test as an approximation 
to LRT rejects H 0 if 


W =„(§"> - 0'")' /„ (0)(0«> - 0'") >xlua- 


The power function of the Wald test is given by 

P.m = p» iw>xh-«i 


= - 05 ")' /„ ( 0 )( 0 <» - 0 <*>) > xh-Jn). 

Now 

(0"> -0' 1 ’)' /„ (0)(0"» — 0q* 1 ) (0"» -0«'>)' /„ (0 O )(0<» - 0<») = . 

As <S 2 > 0 and as — xl,\-a 0 for an y # (l) * ^o" we have lim /3„,(0) = 1 

for any 0 € f2 w , and Wald’s test is consistent which implies that LRT is 
also consistent. We note that the asymptotic distribution of Wald’s test 
statistic is a non-central % 2 with k degrees of freedom and non-centrality 
parameter S 2 as defined above. For specific values of n, sufficiently large, 
the value of /3(0) at a given point 0e£2 Wl can be obtained. For more 
details we again refer to Kendall and Stuart, Vol. 2 (1967) and references 
contained therein. We can also argue in a similar way to claim that the tests 
Th C< ! estimators discussed at the end of Section 9.3 are consistent. 

etailed P roofs are not attempted here as this is a first course in parametric 



Tests of Hypotheses-II 231 


inference. On the other hand we discuss the implications of the consistency 

sampled! available. “ We " “ in the0ly assuming that arbitrarily large 

Example 9.4.1. Consider the problem discussed at the beginning of this 

section, uppose at we want that the test has power ft or equivalently 

type error is ( p) then using (9.4.1), we have the equation defining 
sample size n given by 


1 - 


So ) 

V (To 




which on simplification gives 


n Q 


^oC^l -p ~ ^l-a ) 2 

( 9 0 -e ) 2 


This determination of n 0 is very similar to determination of minimum 
sample size required for a consistent estimator T to achieve a degree of 

accuracy specified by (e, 8) which has been illustrated in Examples 5.1.1 
and 5.4.1. 

Let 0 — 6q + k(j 0 where k measures departure of 6 from Oq in the units 


of standard deviation cr 0 . Then n 0 



+ 1. Observe that for 


fixed (Of, ($), riQ decrease as k increases indicating that larger departures 
from 6q are easier to detect with smaller sample sizes, But we need large 
samples to detect small departures from 6q. Following table gives values of 
n 0 for different combinations j8 = .90, .95, .99 and k = .1, .3, .5, 1 when 
Of = .05. 


Values of « 0 


k\fl 

.90 

.95 

.99 

.1 

857 

1083 

1578 

.3 

96 

121 

176 

.5 

35 

44 

64 

1.0 

9 

11 

16 


In a more realistic situation where <J 2 is unknown, the problem of 
determining minimum sample size n 0 for level Of = .05 test to have power 

P = .90, .95, .99 at specific values of —^ is much more difficult. Again 

We refer to Section 24.35 in Kendall and Stuart, Vol. 2 (1967) and references 
contained therein to obtain the power function of the Student’s t test at 
specific values of (6 - 0 o )/<7 = k for given n. One can obtain values of n 0 
b V assuming n 0 to be large enough so that the distribution of is N( 0, 1) 



232 A First Course on Parametric Inference 


under H 0 and under the alternatives it is lj. This approximation 

will lead to the table of values of n 0 given above. 

Next consider an example in which we have mean and the variance are 
the functions of the same parameter such as in the case of exponential 
distribution with mean 0. 


Example 9,4.2. Let (X h .... X n ) be i.i.d. exponential with mean 0 and let 
Hq : 0 > 6 0 and H A : 0 < #o- The LRT of level a which is also UMP, rejects 
H 0 for small values of X X, and has power function given by 


0(0) = Fn 



where t] n a is the 100 a% point of a Gin, 1) r.v. and 


F n (u ) is its d.f. Suppose we want to determine n 0 such that 


ln,a] - /?at ■— = k. Here as 0 < 0 O , k > 1 and n is determined by 


kr\ na = F n l (P) = 7/ n ^ and tables of incomplete Gamma functions can be 
used to determine n 0 by using trial and error method. If we assume that n 0 

is sufficiently large so that 0-x ~ AN(0, 0 2 /n) then we reject H 0 if 

x < 0 O + -A ^ a y P(0) - 1) + k^ a ]. Hence n 0 is determined by 

Vn 

the equation V n(k - 1) + k^ a = ^ or 


n 0 


(Zp-kZ a ) 


(k- 1) 


+ 1 . 


Following table give the values of n 0 for a = .05, 0 = .90, .95, .99 and 
k = 1.1, 1.2, 1.3, 1.5, 2.0. 


Values of u 0 


k\0 

.90 

.95 

.99 

1.1 

955 

1194 

1711 

1.2 

266 

328 

428 

1.5 

57 

67 

92 

2.0 

21 

24 

32 


The problem of determination of minimum samples size with guaranteed 
power at the specific alternative is a difficult one as it involves equations 
or inequalities (mostly nonlinear) which are difficult to solve. However this 
is an important problem and we refer interested reader to Desu and Raghavrao 
(1990) and references contained there in. 

W ? poim out that for “ = - 05 if we require power = .99 then it implies 
mat the type I error = .05 a type II error = .01. This is contradictory to our 
mitia assumption that type I error is more serious than type II error. 




Tests of Hypotheses~lI 233 

In fact for any consistent test this • 

since as n -» power j an , sltUat ion will occur for any ot e (0,1) 
paradoxical situation arises becau f° re type n error -> 0- The above 
py a we have not taken into acco^ ° ^ Ct tbat * n bounding type I error 
j S to select the level a n dependi ^ ** sam P^ e size* An alternative then 
t he following example to illustrate^ ^ Samp ^ e s * ze ava ^^ a ^ e - Consider 

Example 9.4.3. We consider tesHnrr u ^ ~ 

in me, al) model with a 2 knm„ 8 f° 8 = against tf, : 0 = <9, > 6>o 
large By (9-4.1) 0 n an ^ w here n can be chosen sufficiently 


Pn(0 1 ) = 1-0 


Mb o - 9{) 


+ 


(9.4.3) 


We now require that 1 - /3„( 0l) = Type II error > «„ = Type I error. Thus 
for given large n select 05 , such that 1 - y 3 „(©,) = «*„ = / 3 „(@ 0 ) i.e. 


<D 


Mop-e,) . 

(7n +< 9l-a, 


~ a n - Pn(Bo) 


or 






(gp ~ 0.) 

< <*0 


j + S l-o n ~ £ a„ - - % \~a, 


Now, as 9\ > 9q, ^\- a „ is therefore given by 

t _ Vn f(9o - 9 1 ) 

2 t cr 0 


9—0 

For example if n- 400 and -^- L = 1 then <?i= 10 or the level 

11 2 

a n = l-d>(10). Using the approximation 1 - 0(x) - -~j=e~ x 12 for large 

r or using tables of Abromowitz and Stegun (1964), we obtain 1 - d>(10) = 

10- 23119 . For n = 100, similar calculations lead to a n = 1 - 0(5) = 10" 6 543 . We 
refer to Kunte and Gore (1992) for more discussion on this point. They also 
point out that similar problems occur in tests of goodness of fit. In fact it has 
been observed that very large samples would reject any precise null hypothesis 
Hq at the conventional levels or= 01, .05, .10. 

When H x is composite, 9 > ft. <** would depend both on « and * 6 
By fixing a power at a particular alternative 9 X such that 1 - /^(0,) > a n 

we can determine «„ as a function of n and 9,. However or any 8 e (8 0 , 8,) 

and we have the paradoxical s.tuatton that the 

type n error at 8 \ (6b. ft) f alls be,0W ** '*? ^ 

Paradox also appears even when « is fixed but 8e Q„, » sufficently large 



234 A First Course on Parametric Inference 

as i _ 0 (0) 0 as 0 «. Thus the choice of a, the upper limit to the 

type I error has to depend on the sample size n as well as the alternatives. 

The conventional levels oc = .01, cl- .05 or cl - .10 were introduced by 
Fisher in developing tests of significance which did not refer to specific 
alternatives but depended on a test statistic T whose distribution under 
is completely known. If the observed value of T belongs to the tails of its 
distribution then we reject Hq. The tails were defined by levels cl - .01, 
a = .05 or a = .10. Whether we should use both the tails or only right tail 
or only left tail of the distribution depends on the context and indirectly 
involves H\. Fisher insisted that in scientific investigations alternatives to 
Hq are difficult to specify and power considerations or type II error is not 
relevant. In cases where alternatives are specified as in SQC problems, 
Fisher agreed that the type II error can be considered although he would 
prefer to view these as problems in estimation. 

The Neyman-Pearson theory does not provide any guidelines as to how 
to choose cl. Further once cl is fixed it recommends randomization on the 
boundary E 2 = {x I L(x, 0 t ) = k a L(x , 0 O )}. M° st scientists and engineers do 
not like the idea of making decision based on randomization when the data 
has been collected after a careful experimentation. The real meaning of 
necessity of randomization on E 2 is the fact that some additional data is 
required before the decision could be made. 

Such modifications to fixed sample size procedures have lead to two 
stage or in general multistage and sequential estimation and testing procedures. 
The path breaking work of Wald (1947, 1950) deserves a special mention. 
Wald generalized the Neyman-Pearson approach to Decision Theoretic 
Statistics and Sequential Sampling. We will not pursue this line of thought, 
but instead refer to some well known texts such as Ferguson (1967), Kendall 
and Stuart Vol. II (1967), Rohatgi (1976) and Zacks (1971) among others. 

9.5 Convex Combination of Two types of Errors 

Consider the simplest problem of testing H 0 : 0 = 0 O against //, : 0 = 0| 
on the basis of the sample x = (x h ..., x n ). As seen before let E[(p(x)] = 
PtpiO) be the power function of a test (p . Then the two types of errors are 
given p(p(0 0 ), 1 - Ap(0i) respectively. Let 0 < w < 1 denote the relative 
weight of the type I error so that the average error is given by 

R((p, w) = wfiyid to) + (1 - w) (1 - P 9 (S\)) 

= (1 - w) + J [ wL(x , 0o) - (1 - w)L(x , 0i )](p(x) dx 

where the integration is over So u Si, the union of support of L(x, 0o) and 
L(x, 0j). For given weights w and (1 - w) the test (p* would be optimum 
if /?(<p*. w) <, R((p , w) for any test function (p(x). To minimize R{(p , w) for 
variations in <p observe that as 0 < <p(x) < 1, if [wL(x, 0 O ) - (1 - w)L(x, 0|)1 




Tests of Hypotheses-Il 235 

< 0 m = 1 “d <W = 0 if Wu> . (1 _ „)/*,, e,)] > 0. On 

the set ^2 - '*' wL V' e o) = (1 - w)L(jc, 0 ,)}. <?*(*) can be defined 

afbi trari ^ C ? r . , ° Ve . r zero and does not contribute to R((p,w). 

Thus the best test which minimizes average error for fixed weights w is 

given by 


(pw (*) ~ 1 if L (*» ^i) > j~; L(x, 0 O ) 

= rw ifL(x, e { ) = T ^-L(x, 6 0 )> 
= ° if L(x:, 0 l} < y-j— L(x, 0 O ) 


(9.5.1) 


The structure of <p*(x) is similar to that of MP test given by N-P lemma 
w k 

^ k= \—^ otw= \h- 

Now in N-P lemma, to obtain MP test we take y(x) = yon the set E 2 , such 
that E[(po(x) I 0 O ] = Of and without loss of generality we can take y(x) — y 
on the set E 2 for the test (p*(x). In case P(E 2 ) = 0 under both H 0 and H h 
we have a one to one correspondence between the MP test (Pq(x) of level 
a and cp*(x ) which minimizes R((p, w) for given w. Thus in choosing cx one 
implicitly fixes w and in choosing w one implicitly fixes a, at least in the 

case where PeoiEi) = 0 = Pq 1 (E 2 ) which is the case when we have a 
continuous X with pdf { L(x , 9), 6 e £2} with the support of L not depending 
on 9. In the further discussion we will assume this to be the case and 
therefore take 


(pt(x) = 1 if L(x, 9 { ) > yz~^;L(x, G 0 ) 
= 0 otherwise. 


(9.5.2) 


Let the a corresponding to k = be denoted by a w and (^(x) the MP 

•evel a w test is in one to one correspondence with (p*{x). 

One can interprete the weights w and (1 - w) also as probabilities measuring 
uncertainty about the values of 6. Thus we can regard w = P (0 = 0 Q ) and 
0 - w) = P (0 = 0 ( ) and then R((p, w) is the average error of test procedure 
9 when the distribution over {0 q, # 1 } is given by weights {w, 1 - w} 

The basic idea of expressing uncertainty about the parameter value 0 by 
^ Probability distribution over possible values of 0 is due to Bayes (1763) 
u Ppose that we have k possible mutually exclusive and exhaustive causes 
hypotheses //j, H 2 , ...» indexed by 0j, ..., 0^ respectively and the 
Probability of obtaining data when the hypotheses Hj i.e. say 0 = 0 . hold 

* glven by P(D I Hj) = UD I Oj). Further if prior probability distribution 

s . er l®n •••> 6 k ) is given by (W|, .••.*•'*) then a straight forward calculation 
°^s that 



236 A First Course on Parametric Inference 

P(Hj I D) = L(D I 8j)WjlX L{D I 9j)wj (9.5.3) 

which is the well known Bayes Theorem first proved by Laplace (1774). 
Wh6n k = 2 with possible values 9 0 , 0i with prior probabilities w and 
1 - w respectively, we have 

L(x, 0 o)w 

F(e 01 x) = L(x, e 0 )w + Ux, e,)(i -•*-)[ 

P(e . . K*. W - ») _ 4) 

P(O x \x) - L(X ' ^ + L(x m _ w) j 

{P(fy I x), P(d { I *)} is called as the posterior distribution over {0 O , 0,} 
when the data x is observed. According to Bayes the statistical inference 
consists of change from prior distribution {w, 1 - w} over {0 O , 0!} to 
posterior distribution {P(0 O I x), P(6\ I *)} when the data x is observed. 
We define the Bayes test for H 0 against H\ as 

(p B (x) = 1 if P(0i I x) > P(8 0 I x) 

- 0 otherwise. 

However P(0 b x) > P(0 O I x) is equivalent to L(x, 0j) > — L(x, 0 O ) and 

(psix) is equivalent to <p*(x) which minimizes R((p, w) and corresponds to 
MP level a w test. Thus Neyman-Pearson approach of controlling type I 
error at a level a and then minimizing type II error is equivalent to minimizing 
average error R{(p , w) with weights w and (1 - w) which in turn is equivalent 
to Bayes test for H 0 against H { when prior distribution over {0 O , 0!} is 
given by [w, 1 - w}. Of course if P{Ef) ^ 0 under either 0 O or 0j this will 
create a few problems of details in establishing such equivalence and we 
will not pursue this aspect further. 

The above approach of assuming a prior probability distribution on the 
parameter space of interest is originally due to Bayes (1763). Bayes solution 
to the problem of estimation of 0 in the model { L(x , 0), 0 e Q} was to 
quantify uncertainly about 0 by a probability distribution called prior 
distribution of 0. Then treating 0 as an unobservable r.v., we define the 
joint distribution of (x, 0) given by L(x, 0) p(6). Using standard arguments 
of conditional probability, Bayes suggested obtaining posterior distribution 
of 0 given x, definded by 

p(8 I x) = L(x, 0) P(0)/J^ L(x, 0) p(6) dO (9.5.5) 

in case 0 is continuous. If 0 is discrete then the posterior distribution of 0 
for the data x is given by 

p(9 ,1 At) = Z.(At, flj) p(0,)/I, L(x, e r ) p(6 r ) 


(9.5.6) 




Tests of Hypotheses-II 237 

The above method was proposed bv iw 
which in ^i s own words was stat a * Dayes t0 s °lve the following problem 
* “Given the number of times innW 5° llows - 

n d failed; Required the chance th ♦ u an unknown event has happened 
ingle trial lies somewhere betw^ * the probabilit y of its happening in a 
be named”. any two degrees of probability that can 

Note that an ‘unknown event’ m** , 

. , single trial has unknown D ”u aK ?, n0 *^ eVent E whose occu f ence 
Required part, Bayes uses a dl « b b y P(£) £ (°. O Observe that in 
^ Thance [a < P(E) < Al u Crent Word ‘chance’ than the probability 

^n^ns g^Tsec ToTr 3 " *» < < *>• « h latCT 

ffl6 ans same as probability Assuming’ ay ?, S ^ tha ‘ b> chance , h ® 
aod uniform prior distribution for 0 over (0, 1) Bayes shows that 


Chance [a < />(£) < b] = f‘ p »(i _ p y dpl f p ,„ (] _ pydp (9 . 5 . 7) 

Ja Jo 

when Ehas occurred m times and failed to occur n times in a Bernoulli 
series of (m + n) trials. 

Laplace [(1774), (1812), (1820)] took lead from Bayes and proved Bayes 
theorem and formulae (9.5.5) and (9.5.6). Given the status of Laplace as 
“the second Newton” in European mathematical world, the Bayes-Laplace 
method soon established itself and won many great followers and came to 
be known as the method of inverse probability. For example Gauss showed 
that under local uniform prior for 6, the mean of the normal distribution, 
the posterior distribution of 6 is N{x, cf/n). Gauss then called x as the 
maximum (posterior) probability estimator of 9 since the posterior distribution 
of 9 given x has the unique mode at Jc. 

Fisher (1912) justified his method of MLE using the inverse probability 
argument although Fisher (1922) later adds a disclaimer “I must indeed plead 
guilty in my original statement of the method of MLE to having based my 
argument on inverse probability”. The change in the view point of Fisher 
was due to criticisms of Boole (1958), Venn (1866, 1876) and the fact that 
Bayes-Laplace method of inverse probability required specification of prior 
distribution which introduces a subjective element in scientific inference. 

However Bayes-Laplace theory is undoubtedly attractive and fits well 
into Descartes-Bacon philosophy of rational empiricism which assumes 
that one can learn from empirical data obtained by careful experimentation. 
Even though Fisher later became a critic of inverse probability he gives due 
credit, (Fisher, 1956), to Bayes for systematically tackling the problem of 
inductive inference for the first time when the logicians and philosophers 
those times regarded that the inductive inference is not possible. Fisher 
(1930) tried to provide solution to the problem of inductive inference based 
0n fiducial probability distributions and confidence intervals without assuming 


238 A First Course on Parametric Inference 

0 to be a random variable with known prior distribution but assuming some 
well defined structure on the parameter space £2. We will not discuss the 
issues further here but emphasize that the last word has not been said and 
perhaps will never be said in regard to the solution of the problem of 
inductive inference of which the parametric inference is an important special 
case. 

Chapter 10, dealing with the confidence interval estimation, will show 
that similar problems do crop up there also. 



10 

Interval Estimation 



jO.l Background 

let u 1 ' Sam P leon * with pdf from {fix, 6), deQcz R^. 

In the problem of estimation our search was for an estimator IX*) of 0 or 

y,(0) having ome desirable propeties such as MVUness or CANness. In 
the problem of testing of hypotheses H 0 : 6 e Q H{) against H x e Q Hi , we 
searched tor a test procedure defined by a test function <p(x) with type I error 
bounds a ove y a specified level a and having some desirable properties 
such as UMPness, Unbiasedness, Similarity, Consistency etc. In the problem 
of interval estimation of d we are searching for a random interval (T[(x), 
j 2 (x)) such that for any 6 e Q 


PUtC*) < 0<T 2 {x) I 0] > 1 - y, v G<= Cl (10.1.1) 

where yis a specified constant. The constant on the RHS is called as the level 
of the confidence interval (Cl). If P[7 j(jc) < G < T 2 {x) I 6] does not depend 
on 0 then this probability is called as the size of Cl given by (Ti(jc), T 2 (x)). 
We illustrate this by an example. 


Example 10.1.1 Let (xj, ..., x n ) be a random sample from N(6, 1) with 
0 € /?] and consider T^{x) - x - a and T 2 (x) = x + a where a > 0. Then 

P[T X (jc) <G< T 2 {x) I 0] = P[- a<X-Q<a\Q] 

Under each 0, X - 0 ~ N(0, 1 In) and therefore 


P[X - a < 6 < X + a \ 6] = 0 [Vna] - 0 [- yIna] = 20 [Vna] - 1. 

Hence (x - a, x + a), for a given a > 0, provides a Cl of size 20 ( ylna ) - 1. 
Observe that the size of the Cl, (x - a, x + a), as a takes values over (0, «>) 
ranges over (0, 1) for any fixed n. Therefore given y, we can determine a y 
s uch that size of the Cl is 1 - y and the corresponding a y is given by 


a v = -L < tr' (\- r/2) = (10.1.2) 

An y value of a > a y will provide Cl of level 1 - y and size larger than 
1 ~ Y but larger the value of a, larger will be the length of Cl which is 2 a. 
n the limit we have size —» 1 as a -» 00 and we have a Cl of size one but 
length infinite. On the other hand if a < a y then (x - a, x + a) will not 
e a Cl of level 1 - y. This shows that our object is to find a Cl of given level 





240 A Firsts Course on Parametric Inference 


1 - y which is such that its expected length = E(T 2 (x) - T\ (at)) is as small as 
possible. 

Suppose we consider Cl the type T\(x) - x + c and T 2 (x) = x + d with 
- <*° < c < d < <*>. Then the expected length is ( d - c) and P[X + c £ 6 % 
X + d I 0] = (Pi'lnd) - 0(V«c). 

We want to determine c and d such that (d - c ) is minimized subject to 
condition that &(^nd) - <P(ylnc) = 1 - y. 

Using Lagrange’s method of undetermined multipliers the equations 
determining c and d are 

1 + A <jp(Vn d) Vm = 0 = 1 + A (pfbicfln (10.1.3) 

This immediately gives (pi'lnd) = <jO(Vnc). As c < d, in view of symmetry 
of cp{x) around zero we must have c = - d. Hence we select d such that 
<P(yJnd) - &(- ylnd) = 1 - y or 



Example 10 . 1.2 Let (X\, X 2 ) be a random sample of size two from an 
exponential distribution with mean 0. Then as x x > 0 and x 2 > 0 we have 
x\ < x, + x 2 and we can take J^x) = xj and T 2 (x) = X! + x 2 . If P(X x < 0 < 
X\ + X 2 ) does not depend on 0 and is constant say 1 - / then (x b x x + x 2 ) 
would be a Cl for 0 of size 1 - y Now let 

1 ~ Y(0) = Jj -jj e ~ (x ' +X2),e dx\ dx 2 ( 10 . 1 . 3 ) 

jcj<0<jcj+jc2 


making a transformation u I = u 2 = ^ we have 

u 0 

1 - 7(6) = jj du, du 


it\<\<u\+ui 


which is independent of 0. One can easily check that 1 - y(0) = e~ l . Hence 
(xj, x, + x 2 ) is a Cl of size e~ l and level less than or equal to e~ l . 

Note that u\ — -gj- or u 2 = are not statistics as these are functions of 

observations and the parameter. Fisher (1935) called a function T(x, 0) of 
observations and the parameter as a ‘pivot’ or pivotal quantity if P[T(x, 0) 
<a \ 6] = G(a , 0) does not depend on 0for any 0e Q, and a e R { . Note that 

(* - 0) in Example 10.1.1 being N(0, 1 In) is a pivot. Similarly, in Example 
10.1.2, U] = y, u 2 = -!■ are pivots as these are independent standard 

exponentials and u 3 = — + — under each 0 has G(2, 1) distribution with 




Interval Estimation 241 


pdffc(“) = “«■*•“ *0 and therefore is » „i, 
construct CL a P lv °tal quantity which can be used 


to 


Example 10.1.2 ( contd .). To illustrate th* * • 

a Cl for 0, observe that for anv e, *se of pivotal quantity w 3 to construct 
Xl+JCl ny a ’ b ' 0 < « < b, we have a < « 3 < fcis 

equivalent to —~L < 0 < w f . - + 

b a • We therefore take 7,(*) = and 


Tl<X) a Then(T iW.r 2 W)isaCIforeofsizel- r =J*g 2 («3) < i«3 = 

Gj(W ' Length ° f the CI of *e above type is (*, + * 2 ) fi-I) which 
is a r.v. Hence we try to determine („, b) such that Gi(b) % >{ , _ 


and Cq + x 2 ) I t - ~ 


a b) 1S as sma h as possible. However as (jq + x 2 ) is 


fI-i N 


subject to constraint 


always positive, this amounts to minimizing 

that 0<a<b and G 2 (b)~G 2 (a )) = 1 - y Using Lagrange’s method, equations 
defining a, b are given by b e~ = a e~“ and G 2 (b) - G 2 (a) = 1 - y. If instead 
of a sample of size two we have a sample of size n then we search for CI 


for 0 of size 1 - yof the type 




2 X, ZjC: 


such that G n (b ) - G„(a ) = 1 - y 


V b ’ a j 

One can show that shortest length CI in the above class of CI has a and b 
defined by a n+l e a = b ll+l e~ b which can be determined using tables given by 
Tate and Klett (1959) for standard values of 1 - y= .90, .95, etc. or by 
numerical methods. 


10.2 Shortest Expected Length CI 

The general problem of determining shortest expected length CI of size 
1 - y can be posed as follows. Let C y denote the class of CIs of level 1 - y 

Determine (7j*(jc), T*(x)) e C r such that 

E e (T 2 (x) - T*(x)) < E e (T 2 (x) - Ti(*)), V G e Q 

for any (7’ 1 (jc), T 2 (x)) e C r This is analogous to determining T * e U y such 
that M t - Mj* is nnd, V 9 € ft, V T e U y or determining <p* e D a such that 
P r (9) > fi (ff), V 9 € ft, V<p € D a . We have already observed that the 
optimal estimators and tests exist in some specific models only. The problem 
of obtaining optimum CI of given level 1 - y is more complex and we will 
restrict ourselves to the CIs obtained by using a pivotal quantity m(x, 9) with 
P d f. h{u) and d.f. H(u ) under each 9 e ft. We therefore consider an interval 

b) such that 

P(ai u(x, 8)<H#) = H(b)-ma) = 1 - y (10.2.1) 
Select „(*, ft such that the inequality * £ 9>ibcm be inverted to 




242 A First Course on Parametric Inference 


obtain Ty(x) <9< T 2 (x) for each jc and every 9 e £2. Then (T^x);T 2 (x)) is 
a Cl of size 1 — y. The upper limit T 2 (x) and the lower limit Ty (jc) of the Cl 
would depend on (a, b). We then select (a, b ) satisfying (10.2.1) such that 
the expected length of the corresponding Cl is shortest. Such an interval is 
called as shortest expected length Cl (SELCI). We will usually consider the 
pivotal quantity u(x, 9) based on minimal sufficient statistic for the family 
{£.(*, 9e £ 2 }. 

Note that in Example 10.1.1, X - 0is a pivotal quantity based on minimal 
sufficient statistic X and in Example 10.1.2, I XJ9 is a pivotal quantity 
based on minimal sufficient statistic Z X,-. To illustrate the technique further 
we consider some more examples. 

Example 10.2.1 Let (X 1# ..., X n ) be a random sample from U( 0, 0) so that 
X (n) is minimal sufficient statistic with pdf 

nx"] 1 

S(*(n>. 0) = —pr> 0 < *0!) < 0 


An obvious choice for a pivotal quantity u{ jc (n) , 9) is with its pdf 
h(u ) = nu n ~ l , 0 < u < 1. 


Now consider {a < u(x {n) , 9) < b) = j 
We have to determine (a, b) such that P(a < u(x, 9) < b 


a <^f<b\ = 


^jr<e< — 

b a 

9) - b n -a n - 1 


- yand E 


in) 


I -1) 

a b) 


n9 (\ 1V 


n + 1 


— - -g I is minimum or Q = 


(\ 


1 


is minimum. 

As b" - a 11 = 1 - y, treating a as a function of b, we have 


nb n ~ l - na n ~ x ^ = 0 i.e. - 


n-l 


db 


db 


a 


n-l 


Therefore 


dQ = _J_da , J_ _ _J_b^_ J_ _ q ,l+1 - b H+l 
db a 2 db b 2 a 2 a" -1 b 2 ~ b 2 a n ~ x ' 

As 0 < a < b < 1 we have < 0 and the minimum of Q occurs at the 
maximum possible value of b, namely b - 1. This gives a = y 1/n and SELCI 


is given by 


r w ’ f") 


Example 10.2.2 Consider a Pareto distribution with pdf /(*, 9) = -1^, 
0. Then Y — log X is exponential with mean ~ and Z Yj is minimal 



Interval Estimation 243 


sufficient statistic with distribution G 


K) 


An obvious choice of pivotal 


quantity is u(y, 0) = 0 X y, with h(u) = „ 0 . 

Consider (a < «(y, 0) <m = J 


r~ <9< ^- 

Xy, 


so that we have to select 


(a, b) such that H n (b ) H„(a) - 1 - y and -=r— (b - a) is minimum. Note that 
J l \ 

^ Yj j ex ^ sts onl y w ^en n > 2. However, since Z y ( - > 0 for any n > 1, 

we select (a, b ) such that Q- {b - a) is minimum subject to the condition 

that H n {b ) - H n (a) = 1 - y. Again treating a as a function of b, we have 
da 

h„(b) — hn(a) — 0. Therefore we must select (a, b) such that e~^ b n ~ l — 
and H n {b ) - H n (a) = 1 - y 

We note that if (T\(x), T 2 (x)) is a Cl of size 1 - yfor 6 and if \ff is an 
increasing function of 9 with ^ > 0 then (y/(T { (x)), \lf(T 2 (x))) is a Cl of 

size 1 - yfor y/(9). If yns decreasing i.e. ^ < 0 then (y KT 2 (x)), yf^T^x))) 

is a Cl of size 1 - yfor y/(0). This follows from the fact that in case y/ is 
increasing 


{x I T^x) <9< T 2 (x) } = {x I y<r,(x)) < xiK9) < \iKT 2 (.x))) (10.2.2) 

and if y/is decreasing the RHS of (10.2.2) becomes {* I y^Oc)) < y/(0) < 
y/(Tj (jc)) }. However this Cl may not be SELCI for y/(0) even though 
(T^x), T 2 (x)) is SELCI. 

To illustrate consider Example 10.1.2, where we have shown that 


f 


lyi Zy, 


a 


J 


is a Cl for 9 of size 1 - y = G n (b) - G n (a) for any (a, b ) and 


is the SELCI of size 1 - y if a and b satisfy £-V 1+1 - e h b n+l . Consider 
y^(0) = JL the failure rate of the exponential distribution. Then as ^ = 


- 77 < 0 we have y 1 decreasing and 

p 


f 




a 


\ 


Zy,-’ Zy,- 


is a Cl of size 1 - y. 


However the SELCI for is not same as that for 9 since for yr(0) = -i- , the 

constants a, b must satisfy G n (b) - G n (a) = 1 - y and e b = e a 1 1 . 

Next suppose that 0 is vector valued, 0 — (^i, •••» $m) end we want to 
obtain Cl for say 6 X . If we can find a pivotal quantity u(x, 0,) with pdf h(u) 
and d.f. H(u) which does not depend on 0then using the above technique we 
can obtain SELCI for 0! of size 1 - 7 For example for a random sample of 
size n from N(u, <r 2 ), as seen earlier ( X , S 2 )' is minimal sufficient for (p, <T 2 )'. 



244 A First Course on Parametric Inference 


To obtain SELCI for /j we can consider u(x , //) = yln(X - //)/j_£L.j l/2 ^ 
a pivot with pdf given by that of a Student’s t with (« - 1) d.f. One can show 

that x ± t n _ i fl _j, 2 ^=== is SEL CI for// of size 1 - y Similarly a pivotal 

for (T 2 is given by «(x, <j 2 ) = £ which is X lx and SELCI for (T 2 of size 

1 - yis given by where a and b are determined by 

H^b) - H n _i(a) = 1 - yand a 2 h n _ { (a) = b 2 h n _ { (b). 

For details see Guenther (1969). 

This section restricts to pivotal quantities u(x, 9) which depend on 
observations through the sufficient statistic T. This is a consequence of the 
crucial role that minimal sufficient statistic T plays in MVU estimation 
Recall that Rao-Blackwell Theorem allows us to restrict our search for MVUE 
withm the class of unbiased estimators of y/(9) which are functions of minimal 
sufficient statistic T. Similarly for any testing problem 9 e Cl H() vs 6 € ft, 
for any test function <jf>(x) there exists a test function E[(p(x) 1 1] = <p*(/) such 
that the power functions of <p and <p* are identical V 9 e Q. Therefore we 
can restrict our attention to test functions depending on observations only 
through the minimal sufficient statistic T. As the following example shows 
this is not necessarily true for the problem of Cl. 

Example 10.2.3 We have already seen in Example 10.1.2 that for a random 
sample of size two from exponential distribution with mean 9 , T x {x) = x, and 
T 2 (x) - x x + x 2 provides us a Cl of size e~ l . The expected length of this Cl 

is 9. Now consider E(x x I x, + x 2 ) = ±L±ll t0 obtain a fiL±£l X{ + * 2 \ 

The expected length of this Cl, obtained from (T x (x), T 2 (x)) by conditioning 
w.r.t. minimal sufficient statistic x, + x 2 is also 9. Now 

< X, + I 9j = P g < 1 < „) where h(u) = ue~\ u Z 0. 

Thus, size of the Cl ^ 1 ^ x i + x i j is given by f ue~“du = 2e~ l - 3e' 2 

which is less than <r‘, the size of Cl given by (*,, x \ + x £ Thus the ci 
obtained by conditioning w.r.t. minimal sufficient statistic has same expected 
length but smaller size. However, as the next example shows we have a 

situation where conditioning w.r.t. minimal sufficient statistic increases the 
size for the same length. 

Example 10.2.4 Let (x,, ..., x„) be a sample of size n from N( 9, 1) consider 
= 20(a) ^1 ** ~ a ’ x ' +a )' a>0 havin * Ien gth 2a and size <P(a ) - <P(- a) 


Interval Estimation 245 


l c f _ j* 1 -' _f ^ ~ * an( * thus conditioning w.r.t. x , leads to the Cl given 

which is murh i ° W ^ 1135 same le ngth 2 a but its size is 2 0(^na) - 1 
which is much larger than that of (*, - a , + a ). 

° n C ^ 00s * n 8 a pj vot for 9 based on minimal sufficient 
> r / /x\ q ^ some w hat arbitrary. Further, in examples considered above 

Tn CTpnprai it ‘ W&S SUC ^ ^ at was eas y t0 determine a pivotal quantity. 
st dmnlf* 1S Vei ^ C ^ ear as b° w to const ruct a pivotal quantity even in 
sill not mi ^ ^ aS ® ernou ^i series of trials. This being a first course we 
this Hiffi^i^ 6 * 6 pro ^ em an y further, but consider in the next section how 
PAN «»ct‘ Cai f n avo *ded b y using a pivotal quantity for 0 based on a 
least ima ° r ° ^ S ° ^ at WC Can °b tain SELCI in the large samples at 

10.2.1 (1) Refer Example 10 . 2 . 3 . Show that the Cl of size e~ l obtained by 

conditioning on (*, + * 2 ) is not the SELCI within the CIs of the type [ X{ + X - \ *'■ . 

q * S ^ Ven ^ a “ 1 - 19 , b s 4.45 which has much smaller expected length than 

( ) btain SELCI of size 1 - yfor 0 when we have a random sample of size n from 
Ax, 0 ) = -X exp j- iiij, xe R h 9 > 0 . 

(3 l Lt i x n) be a random sample from the pdf A*, /i) = exp {- (x - ^)), x> /j, 

^ thC P1VOt U(X{1) ’ ^ = " (x C) -/») obtain SELCI for p of size 1 - y. 

( 4 ) Consider two parameter exponential distribution with pdf 

fix, fj, cr) = — exp j— — — j, x<n,fi e R u <y> 0 . 

n 

As seen earlier (X (1) , E (X (i) -X (n y is minimal sufficient for (jj, a)'. Obtain SELCI for 

, . . , n(x ( |) - n) n 

fj based on w,(a (1) ,/0 = - -and that for abased on E (jc ( i) -x (1) )/a. 

2 (*(,) - x (I) ) 2 

2 

10.3 Large Sample Confidence Intervals 


Let T be a CAN estimator of 9 so that T ~ AN 9, one can 

V n 

construct an asymptotic pivotal quantity u(T, 9) = — ~ ^ p or 

which P[- a < u < a] « 24>(a) - 1, V 9 e £L Now, the inequality 
~ a < ~ a ^g^ n ^ a is equivalent to 


is equivalent to 


e .£jp. a <Ti0 + 2jP-a 

Vn Vn 


(10.3.1) 



246 A First Course on Parametric Inference 

If these inequalities can be inverted and are equivalent to T\ ^ 9 - T 2 then 
(7’|, T 2 ) would be called as an asymptotic Cl (ACI) of size 20(a) - 1. Taking 
a = |i _ y /2 we have .ACI of size 1 - 7 and this ACI would have shortest 
expected length among the class of ACI of the type (T + c, T + d) based on 
asymptotic pivotal u(T, 9) and size 1-7- Generally one should choose a 

CAN estimator T with 0^(9) as small as possible. Under usual regularity 
conditions this leads to the MLE and u(9, 0) = ^nl(O) (6 - 6), 

Example 10.3.1 Let (X { , X 2 , ..., X n ) be a random sample from Cauchy 
distribution with location 9. Then 1(9) = j and u(9, 9) = ^n/2 (9 - 9) and 

{- a < ^fn/2 (9 -9)<a) = {9 - yf2/n a<9< 9 + J2hi a) 

Taking a - £,_ y/2 , 9 ± ^|2/n ^_ y/2 is ACI for 9 of size 1 - 7 

Example 10.3.2 Let (Xj,..., X n ) be a random sample from Laplace distribution 

with pdf/(;t, 9) = ^ exp {- I * - 9 1 }, x e R h 9 e R { . Then the sample mean 
X and the sample median M n are both CAN for 9 with asymptotic variances 

— and —, respectively. Therefore (x ± f2/n ^_^ 2 ) and (M„ ± %i-y/ 2 ^ n ) are 
fi n 

both ACI of size 1 - 7 . Note that the ACI based on M n , the sample median, 
which is also the MLE of 9 has shorter length, than that based on X . 

In both the above examples the asymptotic variances <7j(9)/n were 
independent of 9 and inversion of the inequality - a < V n(T - 9 )/Gj{ 6) < a 
was very easy. We now consider situations in which cr y (0) depends on 9. 


Example 10.3.3 Let (X h X 2 ,..., X n ) be a random sample from an exponential 
distribution with mean 9 then 9 = X ~ AN(9, 9 2 /n) so that AV(9) depends 


on 9. Let u(x , 9) - 



be the pivotal quantity and consider inequalities 


- a < Vn 


(x - 9 

~9 


< a 


(10.3.2) 


A straightforward calculation leads to the inequalities, 


which leads to ACI 


9--j-9< x< 0+-J-0 


X 

1 + aHn 


<9< 


x 

1 - a/'In 


Taking a - £|_ y / 2 we have ACI of size 1 - 7 . 


(10.3.3) 


Interval Estimation 247 


way'hlunversTon of “equan^ *? ^ ^ 0 " ° in ^ * 

case ^ alltle s in (10.3.1) is not as easy as in the above 

Example 10.3.4 Let (A, y\u 

with parameter A Then ’ ‘1. ** a L ra ^ dom s i m P le from Poisson distribution 

u(x Al-V«r? h ’ seen before > x ~ AW(A, A/n) and we take 
le’ i vl o ' ^ “ ,he Pivot. Let 1 = 4,. rn . then ACI of 

. 2 * Ven y '"vening - a < u< x, X) < a which is equivalent to 
n(x - a) 2 
-5^ a or 


Jc 2 


2 \ 


- A 2Jc + — 

v n 


+ A 2 < 0 


(10.3.4) 


R H S ofX10.3.4) is a quadratic in Aand can be factored as [A- A,(jc)] [A- A^(jc)] 

w ere i(x) < A^x) are the roots of equation A 2 - A(23c + a 1 In) + jc 2 = 0. 
These are given by 


A ,(*) = ( 2 * + a 2 /n) - f(23c + a 2 /n) 2 - 4x 2 i l/r 

2 

^ (10.3.5) 

A 2 (x) = (2* + a /n) - {(2x + a 2 //i) 2 - 4x 2 } 1/2 

2 

Note that (2i + a 2 ln) 2 - 4x 2 = 4i sL + > 0 for all it since i > 0 

w.p. I under each X > 0 and therefore X,(x) and l^(x) are both real. Thus 
(A,(x), A 2 W) is ACI of size 1 - yfor A. 

It can be shown that for a random sample of size n on b{\, 0) the ACI of 
size 1 - yfor 0 is given by (0j(x), 0 2 (x)) where 


_ (2nx + a 2 ) - (4 a 2 n *(1 - Jc) + a 4 } 


0\(x) = 


e 2 (x) = 


1/2 


2(a 2 + n) 

{In x + a 2 ) + {4a 2 nx{\ - x) + a 4 } 1/2 


2 (a + n) 


(10.3.6) 


In case of Poisson or binomial models the inversion of inequalities was 
not that complicated since (jj-(d) was such that it led to a quadratic equation 
in 0. However if cr£(0) is a complicated function of 0 then the inversion of 
the basic inequalities would be much harder. Further in case where 0 is 
vector valued say (0|, ...» 0 m )' and T is CAN for 0 l5 crj-(O) may depend on 
nuisance parameters 0 2 ,0 m and u(T , 0) = Vn(7- Q\)lo T {Q) may not be an 
asymptotic pivot. We get around both of these problems by the technique of 
studenlizing the asymptotic pivot u(T, 0), i.e. estimating (Tj{6) by a consistent 
estimator of <x 7 (0). Observe that for N{jj, cr 2 ) case with ex 2 unknown, the 




248 A First Course on Parametric Inference 

studentized version of «*. ») = U* by Studen, ' s «w 

f 5 2 | 1/2 . * 

with <T substituted by its consistent estimator ■ Note that lf 9 * 

consistent for 0 and <7 r (0) is continuous then <T r (0) —» d T {9). Therefore 
yj n (T - 0,)/Or(0) N(0, 1) and is an asymptotic pivotal quantity. The 

inequalities 

- a < 'in (T - 0 l )/(T 7 (0) < a 
can now be inverted very easily to obtain 

T- 4- o>(0) < 0< T+ -f cr r (0) (10.3.7) 

\n Vn 


For a = £,_ y/2 , r± ~ cr r (0) gives us ACI of size 1 - yfor 0,. 

Thus studentized asymptotic pivot helps us in proposing a reasonable 
solution when nuisance parameters are present as well as when the basic 
inequalities are difficult to invert. We illustrate these situations by suitable 
examples. 


Example 10.3.5 Consider one way ANOVA model with xy = /u,- + £y, i = 1, 
2, ..., k and j = 1, 2, ..., n. Let be i.i.d. W(0, o 2 ) so that 0 = (jj\, ..., 

<7 2 )'. Then as seen before the MLEs are given by /i, = x h i = 1,2, ..., k and 

1 n 

G 2 = -r2,S?/n, where Sf = X (Xjj - x,) 2 . The asymptotic variance 

i=l j 

( 2 2 4 \ 

,..., 1. Consider 

n n n ) 

n 

a linear function 0 = X . Then X /, X, ~ AN(L lj/j h X lj cJ 2 /n), since cr 

is unknown we studentize and consider the studentized asymptotic pivot 

1/2 

so that the ACI of size 1 - y for 

X li/jj is given by 


given by (I l,x, - X /gt,)/X/?J 


I/,*,- ±2!?2. (I J?) 




2 " /2 a. 


In the above set up we can drop the assumption of normality of errors and 
assume instead that {£/,} are i.i.d. with E(£y) = 0 Var (e & .) = <j 2 . Then by MV 


CLT, X /,*, ~ AN X l iflh 2- X If ] and o 2 = j X A <r 2 

\ n I k f=1 n 


as n 


OO 


We can further relax the assumption of same variance within each block 
us now £(£,,) — 0, { £jj } are mutually independent but are i.i.d. only withii 
each block with Var (£„) = <r? f i = 1, 2.*. Then (x„ x 2 ,x*)' is CAf 



Interval Estimation 249 


forO^,^ •••>/**) with 


( °1 
n 


2 \ 


n 


. Then 


J 


asymptotic variance covariance matrix given by diag 


£ h X, ~ AN 


f 


2 \ 


/=1 « 


Now <T ( ^ X, (Xy x,) 2 A- cr ( 2 , i = i, 2, &. Therefore studentized 

asymptotic pivot for 1 /.p. i s 




* 

E 

i=i 


1/2 


and X /,• x, ± <j 
for k = 2, we have 


£ /? —i- 

' n 


~ 2 >l 1/2 

cr 21 


f is ACI of size 1 - yfor X lj/Uj. In particular 


*\ ~ *2 ± £l- 


yi 2 1 


£? a£ m 

n n 


is ACI for i*\ ~ 1*2- 


We can also relax the assumption of the same sample size within each block 

i.e. take j = 1, 2, n h i = 1, 2, k. Then assuming that each n t —» o° 
we have 


Xl/Xi ~AN 


( 


<7 2 X 


( 


2 hH.XQZL and Z l,x, ± |,_ r/2 2(?^- 

" , Hi 


~ ?\l/2 
O’ 2 ' 


y 


is ACI for X Ij/jj of size 1 - y, where of = — X (x (7 - *,) 2 . 

Mi 7=1 

When situation is such that the parameter is real valued we construct the 
asymptotic pivotal quantity u(T, 0) = Vn(T- 0)/cr r (0) where 9 is consistent 
for 0and o T (0) -?-> o T (9). We could use 9 = T and in general we use the 

MLE 9 and the asymptotic pivot u(0, 9) = *\/n/(0) (9- 9). Then ACI for 

0of size 1 - yis given by 9 ± c,\-yi 2 , The ACIs obtained by using 

V«/(0) 6 

K9) instead of 7(0) generally difer by terms 0(1 Nn) as can be seen from the 
following examples. 

Example 10 .3.3 ( contd .) We now use asymptotic pivot - 6 -l instead 

of Mx ~ 9) ' Then the Acr for e is given by 



250 A First Course on Parametric Inference 


jc(1 - ahin) <0<x(\ + aNn ) 
From (10.3.3) the lower limit of ACI is 


(10.3.8) 


1 + -t- 


= x < 


Vn 


'-Tn +0 {Tn 


Similarly the upper limit from (10.3.3) is 


1 + 


a 




V/zJ 


Therefore ACI given by (10.3.8) and (10.3.3) differ by 0(1/Vn). 

Example 10.3.4 ( contd .) Here the studentized version of the asymptotic 

pivot u(x. A) = — —-- is given by — and ACI for A of size 

\A vx 

1 - y is given by 


x — j- VI ^ A < x + -j- VI 
Vn Vn 

where a = 

Now lower limit of ACI based on u(x, A) given in (10.3.5) is 

1/2 


(10.3.9) 


Ait*) = 


2x + — - <4x — + ^ 
1 n n 2 


n 


- . a" _ i-/ -»1/2 Ji . a 2 1 


1/2 




4Jc/ij 


= Jc -a (Jc/n} 1/2 + 0( 1/V«) 


Similarly one can show that the upper limit ^(x) = x + a(xln) m + 0(1/Vn). 
We leave it to the reader to show that for the ACI for 9 in b( 1, 6) model, the 
ACI obtained by studentization is given by 


x(l - jc)] _ 

x-a\ - -y <9<x+a 

n 


x(l - jc) j 1/2 


(10.3.10) 


and the upper and lower limits 9i(x) and 0 2 (x) given in (10.3.6) differ from 
those given above by 0(1/Vn). 

Let T- AN(0, cl(9)/n) and ^is such that ^ * 0 and continuous. Then 
since \f/ is continuous with non-vanishing derivative \f/ is strictly increasing 
oris strictly decreasing. Under this condition {T j (jc) < 9<T 2 (x)} is equivalent 

to ( \f/(Ti(x)) < \p(6 ) < y/(T 2 (x))} if > 0 and if < 0 it is equivalent 




Interval Estimation 251 


to (y/(T 2 (x)) < y/(0) < \ff(T x (x))}. Hence for i/^with ^ > 0, (y/(T x (x), \ff(T 2 (x))) 

is ACI of size 1 - y for yr(0). If ^|- < 0, then (\ff(T 2 (x)), iKW)) would 
provide an ACI for y/(6) of size 1 - y. On the other hand using Theorem 

✓ ^ v 


6.1.1wehaveifr-Aw(e,£Mj thenTO . AA ,L (e)j £M 

Alter studentizing, the ACI for y/(0) of size I - y is given by y/(T) ± 

r— 


(T 2 (0) 

( dyt 

n 

Je 


V/i 




dyA 2 
de J 


-l0=T 


Generally the two ACI’s for y/(0) differ by 

°(1/M We illustrate this by way of an example. 

Example 10.3.4 (Contd .) We have already seen that x ± a(xln) u2 and 

iU), Aqix)) given by (10.3.5) are both ACI of size 1 - yfor A. Consider 

V^(A) - e . Then = - e' A < 0 and y/ is decreasing and (e -A2(A) , e _Al(jr) ) 

*V' 2 l [ fx\ m }] 

s exp ^ - x + a I — I \ | would be ACI 

/I 


as well as 


exp 


. x 

-x - a I — 
n 


x £ j l ) J 

of size 1 - yfor e~ . Now consider the CAN estimator e x for e~ k with AV(e *) 
1 

= — e~ 2 \ A consistent estimator of AV(e x ) = ± e ~ 2x . Therefore ACI for e~ 
A is now given by 

' V 


eUx = e 
in 


1 ±a - 




n) 


Now consider 


exp < 

r 

-x ± a 

( 1 ) 

1/2' 

>=e x exp< 

r 

± a 

rivn 


< 


J 

< 

W 

y 


e x < 


1 ± a 



1/2 


+0(lWn) 


Therefore the ACI of size 1 - ygiven by all the three methods discussed here 
differ from each other by 0(1/Vn). 

There is yet one more method by which we can obtain ACI based on 
variance stabilization transformations due to Bartlett (1937) discussed in 
Exercise 6.2 (3). We illustrate it by the Poisson example considered above 


Example 10.3.4 (Contd.) For Poisson model the variance stabilizing 
transformation is the square root transformation given by (p(x) - {jf } l/2 

leading to {x ) m ~ AN VA, • Therefore ACI for VA = 0 i s given by 


{*} 1/2 ± -A-l. To obtain ACI or A = 0 2 we observe that ~4- 
2 in) dG 


= 20 >0 which 



252 A First Course on Parametric Inference 


leads to ACI given by ^ * ± j =x±a^l xNn + 0 -j- j. Similarly 

tor e~* = e~ e2 = i j/{9) we have = e~° 2 (- 29) < 0 so that the ACI for 

d u 


-A -R 2 • 

e — e is now given by 


exp 


- {x} 1 ' 2 + -S- 
v 2 Vn 


exp 


\2 




which is upto 0(1/Vn), same as the ACIs obtained earlier. 

In most of the above work we have used the MLE as the CAN estimator 
to construct the asymptotic pivot or its studentized version to obtain ACI. 
Recall that by Hodges-LeCam method we can improve the MLE at a specific 
point. Section 7.7 discussed the N(9, 1) case in detail. We also noted there 
the improvement of MLE at one point in the parameter space is obtained at 
the cost of higher MSE at infinitely many points in the parameter space. We 
now show that the ACI obtained by using the Hodgegs-LeCam superefficient 
estimator could have very poor coverage probability for many points in the 
parameter space. 

For any e > 0 consider the coverage probability 


P{X, e, 9) = P[ I X - 9 I < e.l 9] = 0(^ln e)-0(- Vn e). 

Observe that for any 9„ and any sequence e n —» 0 such that Vn £„ -4«we 
have P(x, £„, 9 n ) -4 1 as n -4 Recall that the superefficient estimator is 
given by 

T = x if I x I > a n 
= ocx if I x I < a n 

where 0 < a < 1, a n —> 0 and > lna n —> In Section 7.7 we have obtained 

G,ft, 9) the exact d.f. of T under any 9. Note that 

G n (t, 9) = 0[^ln(a n - 0)], aa n < t < a n 

and G„(f, 9) remains constant on the interval [aa n , a n ]. Now consider the 
coverage probability P(T, €, 9) = P\ I T — 9 I < £ I 0] and take e n = ca n and 
let 0„ and a n be chosen such that aa n < 0„ -£„< 9„ + £„< a n . Then for such 
a sequence the coverage probability P(T, £„, 9„) = P[ I T n — 0„ I < e n I 0„] = 
0. Thus whereas coverage probability P(x, £, 9) —» 1 as n -4 °° for any 
choice of e and 0, there exist sequences e„ and 9 n such that the coverage 
probability P{T, 0„) —> 0 as n —4 °°. For example, we take a n = n l/ and 

„ (1 + a) n -1/4 „ _ (1 - a) n~ 1/4 

& n — 2 ’ ^ 

For more details see Kale (1985). 


Interval Estimation 253 


, In .' h " nnint^f 1 * 011 W ° d * SCUSS tlle problem of confidence intervals from 
the view point of coverage probability. 

Exercise 10.4 (1) Let (X v\ K .„ j 

distribution with mean G ° f size " fr ° m 30 ex P J onent,a ^ 

c j nn nf acvmntnt 1 , ^ ^ - e 0 ■ Obtain ACI for e~'» ie by using studentized 

. using its mnnot^ P ' VOt ° n the CAN estimat <> r of y{0). Obtain ACI for e~' a ' 0 

y vw th* va • ° ne r ure and Acis for 6 S iven by (10.3.3) and (10.3.8). Further 

fnr ^ 1 and lllZatl0n transformation <?(*) = log it. Using this obtain ACI’s 

other by 0(l/V/i) W ** 3 A ^ s Gained in these diferent ways differ from each 

(2) Let (X|,..., X„) be a random sample from N(jj, o 2 ), p e R h (7 2 >0.An asymptotic 

pivot for^i + £ p o can be based on x + £ 6 with its variance estimated. Using this obtain 
ACI of size 1 - y for p + $ p o. 

Let x { ) be CAN for p + $ p(J where r = [np] + 1. Obtain ACI based on x tr) and its 
estimate asymptotic variance and using median for estimating p and interquartile range 
or estimating <7. or expressions for variance of linear functions of order statistics for 
large samples from N(p, o 2 ) refer to David (1981). 

(3) Let (X,,..., X n ) be a random sample from U(p- 1,/i + 1 ). Obtain ACI for p based 
on X, M n the median, X {1) + 1 , X M - 1, and (X (1) + X {n) )/2. 

(4) Let (Xj,.... X„) be a random sample from U( 0, 0 ). We have already seen that u(x M , 


0 ) = 


n(6-x {x) ) j 

e 


Z, where Z is standard exponential. Obtain ACI for 0 using the 

asymptotic pivot u(x {n) , G) and compare this ACI with SELCI for G obtained in Example 

10 . 2 . 1 . 


10.4 Unbiased Confidence Intervals 

Neyman (1937) introduced a criterion other than the length of the Cl based 
on probability of covering a wrong value of 6. Let (T { (x), T 2 (x )) = E be a Cl 
for 0 of size 1 - yand consider C(0, O') = PU^*) < O' < T 2 (jc) I 0] i.e. the 
probability that (T^x), T 2 (x)) covers 0' * 0 which is the underlying true 
value 0. We call (Ti(x), T 2 (x)) as unbiased Cl of size 1 - yif 


C £ (0, 0') = 1 - 7 for 0= O' 
< 1 - y for 0*0' 


(10.4.1) 


Thus (T { (x), T 2 (x )) is an unbiased Cl if probability of coverage of true value 
0 is always larger than the probability of coverage of a wrong value 0' * Q 
the true value of the parameter. This is analogous to unbiasedness of a test 
of level a which requires that the probability of rejecting H 0 when //, holds 
should always be larger than the probability of rejecting H 0 when H 0 holds 
(i.e. power > level). We give two examples one in which SELCI is unbiased 
and the other in which it is not. 


Example 10.4.1 Let (X lt X n ) be a random sample of size n from 
A K0, 1), then within the class of Cl of type (x + a, x + b) we have seen that 
SELCI of size 1 - y is given by 


E = 





where a = £,_y 2 . 



254 A First Course on Parametric Inference 


Now C E (0, O') = l\x — 7 - < 6' < X + — 7 —I 0 .By routine calculations 

I in in J 


we have 


C E (0, O') = 0(6 +a)~ 0(6 -a) = C E (6) 
where 8 = Vn(0' - 0). 

Note that Cf(0) = 1 - yand as 5 —> ±°°, C E (8) —» 0. 

(a 2 + 8 2 ) 


(10.4.2) 


m dC E (8) 

Now —_ exp < - 


d8 


\ [e- Sa - e 6 *} 




and as a > 0 , 


dC E (S) 

dS 


>0 

if 

8< 0 

= 0 

if 

5=0 

<0 

if 

5 > 0 


This implies that C E (5) increases over (- 0), reaches the maximum value 

1 - y at 8 = 0 and then decreases over (0, <»). Therefore C E (8) < 1 - y for 
8*0 and E is an unbiased Cl of size 1 - y. 

It is left to the reader to verify that Cl of size 1 - y of the type (x + c, 


x + d) with c < d is unbiased only when -c-d- 


ti- 


yl 2 


yIn 


(Guenther, 1971). 


Example 10.4.2 Consider (X), ..., X n ) a random sample of size n from 


exponential distribution with mean 0. Let E = 


__ f £*,■ 2 Xi 


b ’ a 


be Cl for 0 of 


size 1 - y. Then, as seen before, for E to be SELCI, a and b are given by 
G n (b) - G n (a) = 1 - yand = b n " e - h (10.4.3) 

Now consider 


c E (o, o')= p\^jr<0'<^r 


I 0 


= p 


< 6 ' h\o' 

o < ~ < T bie 


= G, 




( 6 '\ 

a 0 


0' 

Then letting = A we have at A = 1, C E ( 0 , O') = 1 - y. Consider G n (bX) - 
G n (a A) = yr(A, a, b). Then yr(A, a, b) —4 0 as A —> 0 and A —> »». Now Cl, E will 

be unbiased if ^ = gn (bX). b - g n (aX)a = 0 at A = 1 i.e. a V s = b n e~ b . Having 
selected a and b to satisfy a n e~ a = b n e~ h , we show that E is an unbiased Cl of 



Interval Estimation 255 


““ '' rThiSCanbe P roved if we show that b) = 0 at X = 1 and 

iX 

r 1 < 0 . 


dX 2 

\ 

Now 


7a=i 


d 2 \ff 


. 2 _L, [ *" <f,;l) 1,2 - < a *) « 2 ]a=i = ritffl * 2 - g’. (a) a 


For h > 2, g-( u ) = 1 


F(n) e ~“ <» ' V 2 - 


r(») 




Thus 


2 “ ^ - 1 )b n - <rV +1 - <r"(n - 1 ) a n + *TV ,+1 } 

But e b = e^a' 1 and therefore 


g n (b)b 2 - g' n (a)a 2 = { e a a n a - e~ b b n b } ( 10 . 4 . 4 ) 


Thus LHS of (10.4.4) = 


1 

r(n) 


e V(a - b) < 0 as a < b. 


For n = 
e~° - e~ h = 


show that 



We note that SELCI of size 1 - / for 0 has a and b defined by G„(b ) - 
G n(a) = 1 - 7 and e~ h b n+l = e~ b a n+l whereas for the unbiased Cl of same 
size 1 - 7 we must have a and b such that G n (b) - G n (a ) = 1 _ y and 
e~ b b n = e~ b a" and SELCI is not an unbiased Cl. 

Neyman (1937) then suggested obtaining optimum Cl which is “best” 
within the class of unbiased Cl of size 1 - 7 , i.e. for which C E ( 0 , O') = 

1 - 7 for 0 = 0' and C E (0, 9') < l - 7 and is uniformly as small as possible 
for 0 * Q\ 


Neyman criterion of unbiasedness is more appropriate in case of one 
sided CIs of the type (T^x), °°) or (- T 2 (x)) where the length of Cl is 

infinite. However the wrong values of 0 are to defined as 6' < 0 in case of 
Cl bounded below by T\(x) and 0' > 0 in case of Cl bounded above T-,(x) 
We illustrate this by an example. 


Example 10.4.2 (Contd.). Suppose we are interested in a Cl of the type 
(Ti(jc), 00) for 0 which is natural as 0 represents in this case average life of 

an item under consideration. Then we select the same pivot and interval 



256 A First Course on Parametric Inference 


E = 


f 


a 


00 where P 




a 


<010 = 1 - yor a such that G n {a) = \ ~y 


' / \ / 

Here the probability of E covering a wrong value O' & 0 is given by 


C £ (ft 6') = G.[a -yj 

= 1 - y if 0 = 0' 

< 1 - y if G' < 0. 

Note that C £ ( 0 , &)> \ - y d' > 6 b\xi this is not objectionable since if 

2 Xj 2 jc • 

a - 0 =* ^ S' if 0' > 0 . Thus wrong values here are 0 ' < 0 , and 

covering of any such value is less desirable. 

Neyman’s formulation of the Cl problem is similar to his formulation of 
the problem of tests of hypotheses where, Type I error namely rejecting H 0 
when H 0 is true was bounded above by a. This is equivalent to probability 
of accepting H 0 when H 0 is true being bounded below by 1 - or. Thus rather 
than looking at critical region (CR) or test function (p for H 0 , in the Cl 
problem we look at the compliment of CR, the acceptance region or the 
acceptance function 1 - (p. The reader would note that the compliment of 
the Cl of level 1 - y for 0 provides critical region for testing H 0 : G' = G 
vs Hi: 0' * 0 in the two sided case. Similar remark holds for H 0 : G' < 0 vs 
H\ \ G' > 0 in the case of one sided Cl (r^*), °o). 

For any 0 O e £2 suppose there exists a non-randomized test of level y 
based on a statistic .?(*) which rejects H 0 : G = G 0 against H x : Q * Q 0 when 
K 2 (G 0 ) < s(x) or s(x) < tf,( 0 o) for each 0 O e ft. Then we have 

P[K\(G 0 ) < s(x) < K 2 (G 0 ) I 0o] = 1 - 7 , V G 0 € ft. (10.4.5) 

Suppose K\(Gq) and K 2 (Gq) as functions of 0 q e ft are strictly increasing 
then (10.4.5) is equivalent to 


P[K 2 ' (•*(*)) < 0 o ^ 1 (•*(*)) I 0oJ = 1 - y V Gq e ft ( 10 . 4 . 6 ) 

and (^"‘(jW), tff 1 (*(*))) is a Cl for G of size 1 - y If K { (G 0 ) and K 2 (G 0 ) 

as functions of G 0 <= ft are strictly decreasing then (Kf l (s(x)) K~Us(x ))) is 
a Cl for G of size 1 - y. 2 * 

Example 10.4.1 For N(G, 1) case for testing H 0 : G = G 0 vs G ± G 0 we 

have s(x) = x and tf,(0 o ) = G 0 - K 2 (G 0 ) = 0 O + Both K x m 

Vn yj n lv u 

and K 2 (G 0 ) are strictly increasing and (3c) = x + an£ j 



Interval Estimation 257 


y -\ ( 71 __ — ^1 7/2 

k 2 W - x - - —— so that 
of size 1 - y. 



6 - y/2 _ 
Vn ’* 


+ 


ft 


- y/2 " 


is Cl for 0 


Example 10.4 2 ( Contd .). Consider one sided problem tf 0 : 0< 0 O vs H \- 6 
> d 0 then we have s(x) = I *, and we reject tf 0 if I x { > K/8 0 ) where K/8 0 ) 

~ ^ y)' Now Ky(Q 0 ) is strictly increasing and therefore K~ l (L x,) 

Ex,- 

“ G" 1 (1 - y) anc ^ we ^ ave y 1 x,) < 0 O I 0 O ] = 1 - y, V 0 O e so that 


f 


\ 


(l - y)’ 


oo 


is a Cl of size 1 - y for 8 the mean of the exponential 


distribution. 

For detailed study of connection between the problem of Cl and that of 
testing of hypotheses, refer Lehmann (1959). We emphasize that view in this 
chapter is that the limits of Cl are random and these cover true value 8 of 
the parameter with probability 1 - y specified in advance. Historically 
Bayesian approach briefly outlined at the end of Chapter 9 viewed the problem 
of confidence intervals as expressing uncertainity about 8 after obtaining 

data using posterior probability of 0 e (a, b ) given by J p{8 1 x) d8. On the 

other hand Fisher was not in favour of using Bayesian approach based on the 
assumption of a known prior probability distribution particularly when the 
prior was either subjective or chosen with mathematical convinience in mind. 


Fisher therefore proposed a solution where uncertainity about 0is expressed 
by fiducial probability statement. In the next section we will briefly consider 
Bayesian approach and Fisher’s approach to the problem of CI. 


10.5 Bayesian and Fiducial Intervals 

Let (x h ..., x„) be a random sample of size n from {j{x , 0), 8 e £2 c R x }. 
Following Bayesian approach, the uncertainity about 8 is expressed by a 
prior probability distribution given by pdf p(8) defined on £ 2 . Then the joint 
pdf of ( X , 0) is given by L(x, 8) p(8) and the conditional pdf of 8 for fixed 
x, the posterior distribution of 0, is given by 

p (0 \x)= t L(x ' e)p{6) 

j L(x, 8) p(8) dd 

For any interval (a, b) e £2 we have then 

P{a < 6 < b) = f p{8 I x) d0 

J a 

The RHS of (10.5.2) is a function of the observations x. For example as seen 
at the end of chapter 9, in the model {b(\, 0), 0 < 8 < 1 } with uniform prior 


(10.5.1) 


(10.5.2) 



258 A First Course on Parametric Inference 


p(9) = 1, for 0 < 6 < 1, for the data in which t successes are observed in n 
trials we have 

P(a < 6 < b) = f 8' (1 - er'mt + 1, n - t + 1) (10.5.3) 

"(I 

Observe that RHS of (10.5.3) changes as t changes. With uniform prior over 
£2 = (0, 1) the posterior distribution of 9 given t is a beta distribution with 
parameter t and n - t + 1 . Note that the posterior distribution of 9 given x 
depends on x only through t = 'Lx i which is minimal sufficient statistic for 
0 in the model specified by { b{ 1, 9), 9 e (0,1)}. In the joint pdf of observations 
the data (*i, *„) are variables and 9 is fixed where as in the posterior 

distribution 9 is a variable and x is fixed and t becomes the parameter. Indeed 
if T is a sufficient statistic for the model {L(x, 9), 9 e £2} and p(9) is any 
prior over £ 2 , it is easy to see that the posterior distribution of 9 depends on 
x only through T(x) = t. To prove this we observe that 

p(9 \ x) = L(x, 9) p(9)/ f L(x, 9) p(9) d9 

Jsi 

= g(T(x), 9) h(x) p(0)l f g(T(x), 9)h(.x)p(9) de 

= g(t, 9) p(9)/ f g(t, 9) p(9) d9 (10.5.4) 

Jn 

In fact Kolmogorov (1950) defines a statistic T to be sufficient in the 
Bayesian framework if for any prior p{9) on £2, the posterior distribution of 
9 given x depends on * only through T(x). One can show that sufficiency as 
defined earlier in Ch. 2 is equivalent to Bayesian sufficiency [Raiffa and 
Schlaifer (1961)]. 

Since p{9 I x) depends on the prior distribution, p(9), for two different 
priors for the same data x we get different values for P(a < 9 < b ). For 
example if we assume a beta type prior distribution for 9 given by 

a beta distribution with a, p known, then 

P(a < 9 < b) = f 9 a+t ~ l (1 - 9)t h ' , -'- l /B(cc + t, p + n - f). 

Ja 

Here p(9 I x) is beta distribution with parameters (a + t, p + n - t). 

In a similar way consider the situation where we have a random sample 

of size n from N(9, <Jq) and where prior distribution of 9 is itself normal 
with mean 9 0 and variance r]l where 9 0 and 77 *, are assumed to be known. 




Interval Estimation 259 


Then using (10.6.4) and the fact that X - N( 6 0 , obn) is the sufficient 
statistic y straight forward but somewhat complicated calculations we can 
show t at the posterior distribution of 9 given x is normal with mean which 
is weig te average of 0 O and x with weights inversely proportional to the 
variances 77 0 and <J 0 /n, respectively, or posterior mean of 

9 = % + _X_ / J_ . _L_ 

c J 2 0 /n/ 772 altn 

or q — + n 77 Q x 

°o+n i)l 

The variance of the posterior distribution of 9 is given by 

< -L + -1 1 ' = \ °i + nT,i }- 1 _ <Tp rjl 

Vo cl/n) \ Til J <^ 0 +/l 7 ?o’ 

For more detailed derivation we refer to Appendix A 1.1 of Box and Tiao 
(1973). Therefore under the above set up 

P(a < 9<b) = 0 (ff j + J)wj - 0 (ol + nrjl) 112 ' 

As an exercise, the readers should work out the case of a sample of size 
n form the exponential distribution with mean 6 with prior distribution of G 

as inverse gamma with pdfp(fl) = exp j_ ij j^iV" ' J_ 0 > 0 , A 0 
known. 

Note that IX, = T is the minimal sufficient statistic for 9 with pdf given 
by 

s(l ’ ^ = rb ^ <r " e ' > °. 8 > o- 


Since P(a < 9 < b) — < q < ^ eas ^ er to work with a 

reparametrization 0 = —■ so that joint distribution of t and 0 is 

g(r, 0)p(0) = pL-pi-j exp (-^(r+... >o,r>o 

The marginal distribution of t is beta distribution of second kind with pdf 

T wral) a ’S'r r « ■ 0<,< °° The P° sterior distribution of 0 given t is 

P(0 ' 0 = Rn l A„) “ P *' + 1)1 0 " +i ”"' (r + 0 < ^ ~ which 

is a gamma pdf. Hence 



260 A First Course on Parametric Inference 


P(a < G< b) = G n+ i 0 



The above approach has assumed prior distributions of G in a such a way 
that the posterior distribution is easy to work out. Further we have also 
assumed that the interval (a, b) is specified and therefore P(a < 6 < b) 
depends on data x, through sufficient statistic t. Thus for different values of 
t we will get different evaluations although on any data x where value of t 
remains same, P(a < G <b) will remain same. 

On the other hand one can select (a, b ) such that P(a <G<b)~ 1 - /with 
(b - a) as small as possible. Such an interval (a, b) will be generally around 
the mode of the posterior distribution of G as the posterior density p(G\ x) 
is thickest in the interval containing the posterior mode as compared to the 
other intervals all having the same posterior probability 1 - y. Such intervals 
are called as Highest Posterior Density (HPD) intervals. Using the standard 
techniques of minimization subject to constraint, in the normal case the HPD 
interval of size 1 - y is given by 


^ - <3 l-y/2 1 


<*0 Vo 


itf + nVo] 


1 1/2 ~ - 
i , - Cndo + nriix 

> where G - —±4 - '4— 


°o +n Vo 


Note that the above Bayesian interval is close to the SELCI for G given by 
x ± £j_ y / 2 <J 0 Nn for large n, since 


<*1 Vo 


1/2 


<*l + n Vo 


= ^4 + 0 -j- | and G = x + 0(l/n) 
in yin 


On the other hand even for small n but large tjI we have 

1/2 ( n >> 1/2 




ol + nt] l 




£o_ 


1 + 


nVo J 




fn 

{Vo) 


1 


and G = x + 0 | — . Therefore for large values of ?7 0 , the classical SELCI 
VoJ 

1 ^ 


Vo 


for 0of size 1 - y agrees with HPD interval of same size 1 - y upto 0 
Similarly in the exponential case the HPD interval for G would be given 
by (a, b ) such that ^ is minimized subject to the constraint 

= 1 -y Here ( a , b) is determined by size 


t + 1 


- G 


t + 1 


constraint and the equation 


& «+A() 


(^ 3(4 


/+ 1 


Sn+X () I 


t + 1 

b 


2 


Interval Estimation 261 


or 


e (r+1)/fl —1__ g-(t+i)ib _1 

a n+ Ao-l I n+ ; 


£"+A o-l 


thernnm^rir^ ^, observed 1 we can use Tate and Klett (1959) tables or 
other numerical methods to determine (a, b ). 

nmhnhii?tv hsus °ty ect i° ns to Bayesian inference based on inverse 
nf the narampf S eav ^ de P end ence on the assumed prior distribution 

Th<» underlvi ^ u 1 S P r * or distribution expresses uncertainty about 0 e £2. 

. [J( 1 n | Cha "f ^haviour of data x is expressed by the model given 

Hiffpr^nt nri * h ^ u • * * S ^ US P oss * b l e that using the same model but 
arrive JSJ 1 uilons Pi(® andp 2 ( 0 ), the experimenter or the statistician 

same data xt distributions P}& • *) and p 2 (d I x) for the 
methnH n tu u r ^ arded as l° ss °f objectivity so crucial to scientific 

twoHiffe n * 6 A ^ ^^r and two ex P er imenters or statisticians based on 

Hifferpn/ 61 !* m< ! ° S i ^ and 9), Be £2} may recommend 

different estimators of parameter 0 based on the same observed data x. This 

j S fff re ® ar . e as ^ oss objectivity or violation of scientific method. This 
difference in our attitude towards data x and parameter 0 is perhaps a 

consequence of making possibly uncertain statements about the true value of 
^ a & lve n model is regarded as a legitimate scientific activity. 

The method of inverse probability, initiated by Bayes-Laplace, had great 
followers such as Gauss, K. Pearson and even Fisher to start with. However 
slowly due to criticism from Boole (1854), Venn (1866) and due to 
contradictions that could arise from specification of different priors, Fisher 
appe;u-s to have turned away from the method of inverse probability as tool 
for inference at least m those situations in which the prior probabiity distribution 
had no objective basis. Fisher therefore proposed confidence intervals for 
unknown parameter 9 using Fiducial probability distributions as an alternative 
to using posterior distributions based on a prior p(9). 

The fiducial intervals as proposed by Fisher are based on pivotal quantities 
and can be obtained only in special models where there is an inherent symmetry 
between the observable r.v. x and parameter 9 e.g. in N( 9 , 1 ) model or in 
general one parameter exponential family in its canonical form. Thus in 
N(9, 1 ) we have pdf of x is given by 


M e) = {£} 


1/2 


exp 


- n(x - 9 ) 2 1 


. X 6 R u 9 e R, 


Here x is r.v. and 9 is fixed. But/(x, 9) is symmetric in x and 0 , and one 
could interchange their roles or fix x and consider 9 as a variable. In (Jc, 0 ) 
plane the set - a < x - 9 < a is such that P[- a< x <a \ 9 ] = 2 d>(Vna) - 1 
V 9e R { and -a < x -9<a is equivalent to x -a < 9< x + a or 9-a 
- x < 0 + a. Fisher suggested that the probability 2 <Z>(Vna) - 1 can be 
associated with the interval (x -a, x + a) and to distinguish it from the 
probability associated with (0 - a, 9 + a) assigned by the distribution of x 



262 A First Course on Parametric Inference 


called it the fiducial probability of 6 and consequently defined fiducial 


distribution of 9 as Nyx, — J with x fixed and viewed as a parameter of the 

fiducial distribution. In many situations fiducial intervals and Cl for 6 coincide 
but its interpretation is quite different. The classical theory due to Neyman 
treats (T^x), T 2 (x)) as random and 9 fixed whereas Fisherian approach treats 
9 as a variable and (T { (x), T 2 (x)) as fixed. Fisher did not want to claim that 
0 is a random variable but maintained that the uncertainity about 9 can be 
expressed as a probability distribution which he called as fiducial probability 
distribution to distinguish it from the classical probability distributions used 
in the model {L(*, 0), 9 e £2}. In this Fisher (1930) has followed Bayes 
(1763) wherein initially the uncertainity about the parameter 0 is called as 
Chance to distinguish it from the Probability of the observed data. Although 
later in the same paper (1763), Bayes stated that by Chance he means the 
same thing as Probability, Fisher maintained the distinction between the two. 


Savage (1954) described this attempt of Fisher as “trying to make an omlette 
without breaking an egg”. The statistical inference based on fiducial approach 
did not get as much following as Bayesian inference. For example whereas 
Kendall and Stuart Vol. II (1967) has a chapter (pp. 134-160) on Fiducial 
Inttfvals its revised version by Stuart and Ord (1991) has only a few pages 

nd SO vi, nm fi T S H 0Pi r- FOr fldUCial argUment we refer to Kendall 
and Stuart Vol. II (1967) and references contained therein as well as the 

articles by Edwards (1983), Buehler (1983) and Stone (1983^ Fnr r • 

inference we refer to some of the old texts such as Lindlev ( 196 S 1 ® ayeS,a " 

Ttao (1973) as well as some recent ones namely Berger (198 5 S’n-H 
(1994) and Bernardo and Smith (1994). ' ' an ^ ^ Hagan 



References 


Abramowitz M. and Stegun I. A. (1968). Handbook of Mathematical Functions. National 
Bureau of Standards. Appl. Math. Ser. 55, Washington. 

Apostol, T.M. (1964). Mathematical Analysis (4th Ed.). Addison-Wesley Publishing Co. 
Inc. London, pp. 143-148. 

Arnold, B.C. (1968). Parameter estimation for a multivariate exponential distribution. 
JourAmer. Stat. Assn., 63, 848-852. 

Barnett, V and Lewis, T. (1984). Outliers in Statistical Data. 2nd Ed. John Wiley & Sons, 
New York. 

Bartlett, M.S. (1947). The use of transformations. Biometrics, 3, 39-52. 
ayes, . (1763). An essay towards solving a problem in the doctrine of chances. Phil. 

rans. Roy. Soc. 53, (Reproduced in Bayesian Statistics by S. James Press (1989). 
John Wiley and Sons, New York). 

Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis, 2nd Edition, 
Springer Verlag, New York, Berlin, Heidelberg, Tokyo. 

Bernardo, J.M. and Smith A.F.M. (1994). Bayesian Theory. John Wiley & Sons, 
New York. 

Beuhler, R.J. (1963). Fiducial Inference, Encyclopedia of Statistical Sciences, Vol. 3, pp. 

76-81. Ed. Johnson N.L. and Kotz, S., John Wiley & Sons, New York. 
Bhattacharya, G.K. and Johnson, R.A. (1973). On test of independence in bivariate 
exponential distribution. Jour. Amer. Stat. Assn. 68, 704-706. 

Birch, M.W. (1964). A new proof of the Pearson Fisher Theorem. Ann. Math. Stat. 35, 
92—98. 

Boole, G. (1954). An investigations of the Laws of thought on which are founded the 

Mathematical Theories of Logic and Probabilities. Dover Publications (1951) New 
York. 

Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis, Addison- 
Wesley, Reading, MA USA. 

Box Joan Fisher (1978). R.A. Fisher Life of a Scientist. John Wiley and Sons, New York 
Chanda, K.C. (1954). A note on the consistency and maxima of the roots of likelihood 
equations. Biometrika, 41, 56-61. 

Chandrasekar, B. (1983). Contributions to the theory of unbiased estimation functions 
Ph. D. Thesis, University of Pune, Pune-411007. 

Cochran, W.G. (1934). The distribution of quadratic forms in a normal svstem Pmr 
Camb. Phil. Soc., 30, 178-191. ' 

Cramer, H. (1946). Mathematical Methods of Statistics. Princeton University Press, Princeton 
N.J. 11th printing 1966. ’ 0n 

Cohen, A.C. (1960). Estimating the parameter in a conditional Poisson Distribution 
Biometrics, 16, 203-211. 

Crichton, M. (1990). Jurassic Park. (pp. 160-166). Ballantine Books, New York 
David, H.A. (1981). Order Statistics. 2nd Edition, John Wiley and Sons, New York 
Desu, M.M. and Raghavrao, D. (1990). Sample Size Methodology. Academic Press, San 



264 A First Course on Parametric Inference 


Draper, N.R. and Smith, H. (1981). Applied Regression Analysis. 2nd Edition, John Wiley 

Edwards, A.W.F. (1983). Fidicual Distributions. Encyclopedia of Statistical Sciences, 
Vol. 3, pp. 70-76, Johnson N.L. and Kotz S., John Wiley and Sons, New York. 
Elderton, W.P. (1954). Frequency Curves and Correlation. 4th Edition, Cambridge University 


Press, Cambridge. .... ,,- 

Feller, W. (1957). An Introduction to Probability Theory and its Applications. Vol. 1 (2nd 

Edition). Ch. 6 Sec. 7, John Wiley and Sons, New York. 

Ferguson, T.S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic 

Press, New York. 

Fisher, R.A. (1912) On an absolute Criterion for Fitting Frequency Curves. Messnger of 


Mathematics, 41, 155-160. 

Fisher, R.A. (1920). A mathematical examination of the method of determining the accuracy 
of an observation by the mean error and by the mean square error. Monthly Notices 
Roy. Astronomical Soc., 80, 757-770. 

Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics. Phil. Trans. 


Roy. Soc. Ser. A, 222, 309-368. 

Fisher, R.A. (1925). Theory of Statistical Estimation. Proceedings of Cambridge Phil. 
Soc., 22, 700-725. 

Fisher, R.A. (1930). Inverse Probability. Proc. Comb. Phil. Soc. 26, 528-535. 

Fisher, R.A. (1937). Professor Karl Pearson and the method of moments. Ann. Eug. 7, 
308-317. 


Fisher, R.A. (1954). Statistical methods for research workers (13th Edition), Oliver and 
Boyd Ltd. Edinburgh; pp. 299-320. 

Fisher, R.A. (1959). Statistical Methods and Scientific Inference. 2nd Ed. Hafner Publishing 
Co. New York. 

Fisher, R.A. and Yates, F. (1963). Statistical Tables for Biological, Agricultural and 
Medical Research. 6th Edition, Oliver and Boyd Edinburgh. 

Fisz, M. (1963). Probability Theory and Mathematical Statistics. John Wiley and Sons, 
New York. 

Foutz, R.V. (1977). On the unique consistent solution of the likelihood equations. Jour. 
Amer. Stat. Assn. 72, 147-148. 

Freund, J.E. (1961). A bivariate extension of the exponential distribution. Jour. Amer. 
Stat. Assn. 56, 971-977. 

Guenther, W.C. (1969). Shortest confidence intervals. Amer. Stat. 23, 22-25. 

Guenther, W.C. (1971). Unbiased Confidence intervals. Amer. Stat. 25, 51-53. 

Hanagal, D.D. and Kale, B.K. (1992). Large sample tests for testing symmetry and 
independence in some Bivariate Exponential Models. Comm. Stat. Theory Meth. 21 (9), 
2625-2643. 


Huzurbazar, V.S. (1948). The likelihood equation, consistency and the maxima of the 
likelihood function. Ann. Eugenics, 14, 183-200. 

Huzurbazar, V.S. (1955). Confidence intervals for the parameter of a distribution admitting 
a sufficient statistic when the range depends on the parameter. Jour. Roy. Stat Soc. B 
17. 86-90. 


Ibragimov, I.A. and Hasminski, R.S. (1981). Statistical Estimation, Asymptotic Theory’, 
Springer, New York, Heidelberg Berlin, p. 91. 

Kale, B.K. (1961). On the solution of the likelihood equation by iteration processes. 
Biometrika, 48, 452-456. 

Kale, B.K. (1962). On the solution of likelihood equations by iteration processes. The 
multiparametric case. Biometrika, 49, 479-486. 

Kale, B.K. and Chandrasekar, B. (1983). On the equivalence of optimality criteria for 
vector unbiased statistics. Jour. Ind. Stat. Assn. 21, 49-55. 



References 265 

Kale, B.K. (1985). A note nn 

259-263. 11 t " C SUper efflcient estimator. Jour. Stat. Plan. Inference, 

Kendall, M.G., Stuart A riQ* 9 \ a j 

and Co. London. ’ U958) ' Advanced Theory of Statistics, Vol. I, Charles Griffin 

^and Relationship^^f^r^h^h^ Theory of Statistics, Vol. II, Inference 

Kendall, M.G. andPlackett R I 

Vol. II, Charles Griffin L )• Studies in the History of Statistics and Probability, 

Klebanov, L.B., Linik Yu v ^a t >' , *.■ . „ 

loss funrtinn« and ^ u ^ in , A.L. (1971). Unbiased estimation and matrix 

loss nmctions. Soviet Math. Dokl. 12 1526-1528 

K ~ d JJ! 50 ]; Unbiased estimates. Amer. Math. Soc. Translation No. 98 

Koshal, R.S(1933) Apl^onofth ^ ^ Malh * 14 (1950) ’ 303r326 - 

of curves fitted by the min t of maximum likelihood to the improvement 

Koshal R S riQ-^sV V , hd ° f moments - Tour. Roy. Stat. Soc. 96, 303-13. 

of efficient f • Ppicat ' on tbe method of maximum likelihood to the derivation 
Kunte S ann r fo f* fitting ° f fre W curves. Jour. Roy. Stat. Soc. 98, 128. 

395 0165 P aradox of large samples. Current Science, 62,393- 

"f-" 77 ?- ^ emoire sur tes probabilities in Ouvres Complete de Laplace, 9, 

393-485, Gauthier Villars, Paris. 

Laplace, PS. t(|812), (1820)]. Theorie analytique des Probabilities, Courcier, Paris, 
auren , A.u. (1963). Conditional distribution of order statistics and distribution of the 
re uced J-th order statistics of the exponential model. Ann. Math. Stat. 34, 652-657. 

am, . ). n some asymptotic properties of maximum likelihood estimates and 

related Bayes estimates. Univ. of California Publications in Stat. 1, 277-330. 
ehmann, E.L. (1959). Testing of Statistical Hypotheses, John Wiley & Sons, New York. 

e mann, E.L. and Scheffe, H. (1950). Completeness, similar regions and unbiased 
estimation. Sankhya, 10, 305-340. 

Lindly, D.V. (1965). Introduction to Probability and Statistics from a Bayesian View 
Point. (Parts 1 and 2). Cambridge University Press, Cambridge. 

Marshall, A.W. and Olkin, 1. (1967). A multivariate exponential distribution. Jour. Amer. 
Stat. Assn. 62, 30-44. 

Neyman, J. and Pearson, E.S. (1933). On the problem of the most efficient tests of 
statistical hypotheses. Phil. Trans. Roy. Soc. A-231, 289. 

O'Hagan, A. (1994). Kendall’s Advanced Theorky of Statistics. Vol. 2B, Bayesian Inference 
John Wiley, New York. , 

Pairman, E. (1919). Tables of Digamma and Trigamma Functions, Tracts for Computers 
1. Ed. K. Pearson, Cambridge University Press. 

Paranjape, S. A. (1994). Moment Estimation of Weibull Shape parameter: Which moment? 
(private communication) 

Pearson E.S. (1978). Lectures by K. Pearson on "The History of Statistics in 17 lh and 18 lh 
centuries”. Charles Griffin, London. 

Pearson, K. (1900). On the criterion that a given system of deviations from the probable 
in the case of a correlated system of variables is such that it can be reasonably 
supposed to have arisen from random sampling. Phil. Mag. Series 5, Vol. L, pp 157_ 
17 5. 

Pearson, K. (1936). Method of moments and method of maximum likelihood. Biometrika 
28, 34-59. 

Pearson, E.S., Hartley, H.O. (1970). Biometrika Tables for Statisticians, Vol. I, Cambridge 
University Press, Cambridge. 

Rao, C.R. (1947). Large sample tests of statistical hypotheses concerning several parameters 
with applications to problem of estimation. Proc. Comb. Phil. Soc. 44. 50—57 


266 A First Course on Parametric Inference 


Rao, C.R. (1962). Apparent anamolies and irregularities in maximum likelihood estimation 
(with discussion), Sankhya B, 24, 73-102. 

Rao, C.R. (1962). Efficient estimates and optimum inference procedures in large samples 
(with discussion) Jour. Roy. Stat. Soc. B, 24, 46-72. 

Rao, C.R. (1963). Criterion of estimation in large samples. Sankhya A, 25, 189-206. 

Rao, C.R. (1973). Linear Statistical Inference, (2nd Edition), Wiley Eastern, New Delhi. 

Rao, C.R. (1992). Statistics as a last resort, In Glimpses of India’s Statistical Heritage 
Editors. Ghosh, J.K., Mitra, S.K. and Parthasarathy, K.R., Wiley Eastern, New Delhi 
pp. 153-213. 

Rohatgi, V.K. (1976). An Introduction to Probability Theory and Mathematical Statistics. 
John Wiley and Sons. New York. 

Rudin, Walter (1964). Principles of Mathematical Analysis, McGraw-Hill Book Co. New 
York. 


Savage, L.J. (1954). The Foundation of Statistics, John Wiley and Sons, New York. 

Serfling, R.J. (1980). Approximation Theorems of Mathematical Statistics. John Wiley 
and Sons, New York. 

Shewhart, W.A. (1950). Contributions to Mathematical Statistics- R.A. Fisher , John Wiley 
and Sons, New York 

St0 o?’ Flducial Probability. Encyclopedia of Statistical Sciences, Vol. 3, pp. 

81-86 Ed. Johnson N.L. and Kotz S., John Wiley & Sons, New York. 

Stuart, A. an d Ord, J.K. (1991). Kendall’s Advanced Theory of Statistics, Vol. 2, Fifth 

Studfn/ n tJ y K E u, Ward Am ° ld ’ a divisi0n of Hodder ^ d Stoughton, London. 

Student (1908). The probable error of a mean. Biometrika 6 1-25 

Tarone, R E. and Gruenhage, G. (1975). Anote on the uniqueness of roots of the likelihood 
equations for vector valued parameters. Jour. Amer. Stat Assn 70 903 904 

Tate, R.F. and Klett, G W. (1959). Optimum confidence intefvX for tie rce of a 
normal distribution. Jour. Amer. Stat. Assn., 54, 674-682. 

Venn, J. (1866). The Logic of Chance. An Essay on the Foundntinn c ^ D , , 

Theory of Probability with Special Reference to its LnoiZt Province of the 

to Moral and Social Sciences. 3rd Ed. McMillan lf^ arm « s and m’Ucanons 

Wald, A. (1943). Tests of statistical hypotheses coneer^, ™ , 

” TnWiT*™ iS large ' TranS Amer - Math - £ 54 a 426 r 482 terS When 

Wa d A 9501 sZT1n natySiS ' J ° h " Wiley & Sons ' New York ' 

Wilks, S.S. (1938). The large sample distribution of the likSod rahoT’ ^ Y °' k 
hypotheses. Ann. Math. Stat. 9, 60-62. n d rdtl ° for testing composite 

Wolfowitz, J. (1965). Asymptotic efficiency of maximum ui. i- k 

of Prob. and its Appl. 10, 247-260 m 1 ^ e , ^ ood estimators. Theory 



Index 


Ancillary Statistic 71 
Arbuthnot 174 
Asymptotic Relative 
Efficiency (ARE) 115, 169 

Basu’s Theorem 71 
Bayesian Approach 237 
Bayesian Intervals 257, 260 
Bayes Test 236 
Bayes Theorem 236 
Best Linean Unbiased 
Estimator (BLUE) 47 
Best Critical Region 180 
Boscovitch 6, 8, 44, 134 

Chandrasekar B. 78, 88 
Completeness 60 
Confidence Intervals 

Shortest Expected Length 241 
Asymptotic 246 
Studentized 248 
Unbiased 253 
Consistent Estimators 
Real valued 90 
Vector valued 91 
Marginally 92 
Jointly 92 
Asymptotically 
Normal (CAN) 115 
Cramer Family 148, 151 
Cramer-Huzurbazar 
Theorem 152 

Cramer-Rao Inequality 50, 83 

De-Moivre-Laplace Limit Theorem 8 
Determinant Optimality 73 
Differentiation Under Integral Sign 19 

Edington A.S. 6 

Ellipsoid of Concentration 75 

Error Distributions 8 

Errors of Type I/II 180 

Estimate 4 

Estimator 4 

Estimability 45 


Euler 8, 137 
Exponential Family 

One-parameter 34, 122, 144, 
Multiparameter 37, 129, 130, 145 

Fiducial Intervals 261 

Fiducial Probability 262 

Fisher Information 10, 11, 15 

Fisher Information Matrix 21 

Fisher R.A. 6, 7, 11, 15, 21, 100, 174, 237 

Fisher-Newton-Raphson Procedure 167 

Gauss 6, 8, 44, 134, 137 

Gauss Unbiasedness 47 

Geiger Counter Data 2 

Geometric Interpretation of Optimality 

Criteria 76 

Hodges LeCam Estimator 170 
Huzurbazar V.S. 43, 153 

Indexing Parameter 1 
Invariance of Consistent 
Estimator 91, 92 
Invariance of Consistent 
Asymptotically Normal Estimators 116 

Iterative Procedure for Solving 
Likelihood Equations 167,168 

Kale B.K. 88, 167, 168, 171 
Klebanov-Linnik-Rukhin 
Theorem 76 

Kolmogorov A.N. 82, 258 
Kullback Leibler Information 153 

Labelling Parameter 1 
Laplace 6, 8, 44, 237 
Largest Eigenvalue Optimality 73 
Least Square 6, 8 
Least Absolute Deviation 6 
Legendre 6, 44, 134 
Lehmann-Scheffe Theorem 63 
Likelihood 
Function 9 
ordering 9 




268 Index S 5 / 

jj 

equivalence 29 
Ratio test 205 

Magnitudes of Interest 5 
Matrix Optimality 73 
Mean Squared Error (MSE) 

Criterion 7, 105 

Method of Moments 9, 96, 119, 127 
Method of Maximum Likelihood 9, 139 
Method of Percentiles 100, 101, 120, 131 
Minimum Sample Size 93, 231 
Minimal Sufficient Statistic 33 
Minimum Variance Unbiased 
Estimator (MVUE) 50 
Minimum Variance Bound 
Unbiased Estimator (MVBUE) 52 
Monotone Likelihood Ratio (MLR) 200 
Most Powerful Test 181 

Newton-Raphson Procedure 167, 168 
Neyman Factorizability Criterion 26 
Neyman-Pearson Lemma 182 
Non-identifiability Problem 3 

Observed Level of 
Significance 217 

Pearson, K. 2, 6, 99 
Pearson System 9 
Pearson Chi-square 172, 213 


Pearson-Fisher Theorem 215 
Pitman Family 40 
Pivotal Quantity 240 
Asymptotic 245 
Studentized 248 

Rao-Blackwell Theorem 54, 55 
Rao-Blackwell Lehmann-Scheffe’ 
Theorem 65 
Rao’s Score Test 221 

Significance Tests 
Level 175, 180 
Logic 174 
Size 180 

Tchebychev Inequality 44 

Test Function 181 

Test Statistic 175 

Type I and Type II Errors 178 

Convex Combination of Errors 234 

Unbiased Estimator 45 
Linear (BLUE) 47 
Non-linear 49 
Of Zero 60 
Vector Valued 72 

Wald 220, 234 








T COURSE ON 


PARAMETRIC INFERENCE 


B.K. Kale 

Department of Statistics 
University of P.une 

Pune - 411 007, India 


The book is intended to be used as a text for an introductory course on 
model based parametric inference in senior undergraduate (honours) or in 
the fust year graduate program. Starting with the basic concept of 
sufficient statistic, the classical approach based on minimum variance 
unbiased estimation is presented in detail. There is a separate chapter on 
simultaneous estimation of several parameters. Large sample theory of 
estimation based on Consistent Asymptotic Normal (CAN) estimators 
obtained by method of moments, percentile and the method of maximum 
likelihood is also introduced. The book concludes with the classical 
Neyman-Pearson theory of tests of ™ ™ 

with emphasis on likelihood ratio “ 

maximum likelihopd estimators. 

•JL \ 

There are many solved examples 
applicable in a variety of practical s 
; s.the use of well-known classical data 


* 

- 




. : 


. ISBN 


,C 1 78fl; 







































